jm + algorithm   10

_Optimal Probabilistic Cache Stampede Prevention_ [pdf]
'When a frequently-accessed cache item expires, multiple requests
to that item can trigger a cache miss and start regenerating
that same item at the same time. This phenomenon,
known as cache stampede, severely limits the performance
of databases and web servers. A natural countermeasure to
this issue is to let the processes that perform such requests
to randomly ask for a regeneration before the expiration
time of the item. In this paper we give optimal algorithms
for performing such probabilistic early expirations. Our algorithms
are theoretically optimal and have much better
performances than other solutions used in real-world applications.'

(via Marc Brooker)
via:marcbrooker  caching  caches  algorithm  probabilistic  expiration  vldb  papers  expiry  cache-miss  stampedes 
may 2017 by jm
LinkedIn called me a white supremacist
Wow. Massive, massive algorithm fail.
n the morning of May 12, LinkedIn, the networking site devoted to making professionals “more productive and successful,” emailed scores of my contacts and told them I’m a professional racist. It was one of those updates that LinkedIn regularly sends its users, algorithmically assembled missives about their connections’ appearances in the media. This one had the innocent-sounding subject, “News About William Johnson,” but once my connections clicked in, they saw a small photo of my grinning face, right above the headline “Trump put white nationalist on list of delegates.” [.....] It turns out that when LinkedIn sends these update emails, people actually read them. So I was getting upset. Not only am I not a Nazi, I’m a Jewish socialist with family members who were imprisoned in concentration camps during World War II. Why was LinkedIn trolling me?
ethics  fail  algorithm  linkedin  big-data  racism  libel 
may 2016 by jm
Lamport timestamps
'The algorithm of Lamport timestamps is a simple algorithm used to determine the order of events in a distributed computer system. As different nodes or processes will typically not be perfectly synchronized, this algorithm is used to provide a partial ordering of events with minimal overhead, and conceptually provide a starting point for the more advanced vector clock method. They are named after their creator, Leslie Lamport.'

See also vector clocks (which I think would be generally preferable nowadays).
vector-clocks  distributed  programming  algorithm  clocks  time  leslie-lamport  coding  distcomp 
may 2016 by jm
Darts, Dice, and Coins
Earlier this year, I asked a question on Stack Overflow about a data structure for loaded dice. Specifically, I was interested in answering this question: "You are given an n-sided die where side i has probability pi of being rolled. What is the most efficient data structure for simulating rolls of the die?"

This data structure could be used for many purposes. For starters, you could use it to simulate rolls of a fair, six-sided die by assigning probability 1616 to each of the sides of the die, or a to simulate a fair coin by simulating a two-sided die where each side has probability 1212 of coming up. You could also use this data structure to directly simulate the total of two fair six-sided dice being thrown by having an 11-sided die (whose faces were 2, 3, 4, ..., 12), where each side was appropriately weighted with the probability that this total would show if you used two fair dice. However, you could also use this data structure to simulate loaded dice. For example, if you were playing craps with dice that you knew weren't perfectly fair, you might use the data structure to simulate many rolls of the dice to see what the optimal strategy would be. You could also consider simulating an imperfect roulette wheel in the same way.

Outside the domain of game-playing, you could also use this data structure in robotics simulations where sensors have known failure rates. For example, if a range sensor has a 95% chance of giving the right value back, a 4% chance of giving back a value that's too small, and a 1% chance of handing back a value that's too large, you could use this data structure to simulate readings from the sensor by generating a random outcome and simulating the sensor reading in that case.

The answer I received on Stack Overflow impressed me for two reasons. First, the solution pointed me at a powerful technique called the alias method that, under certain reasonable assumptions about the machine model, is capable of simulating rolls of the die in O(1)O(1) time after a simple preprocessing step. Second, and perhaps more surprisingly, this algorithm has been known for decades, but I had not once encountered it! Considering how much processing time is dedicated to simulation, I would have expected this technique to be better- known. A few quick Google searches turned up a wealth of information on the technique, but I couldn't find a single site that compiled together the intuition and explanation behind the technique.


(via Marc Brooker)
via:marcbrooker  algorithms  probability  algorithm  coding  data-structures  alias  dice  random 
april 2016 by jm
Inceptionism: Going Deeper into Neural Networks
This is amazing, and a little scary.
If we choose higher-level layers, which identify more sophisticated features in images, complex features or even whole objects tend to emerge. Again, we just start with an existing image and give it to our neural net. We ask the network: “Whatever you see there, I want more of it!” This creates a feedback loop: if a cloud looks a little bit like a bird, the network will make it look more like a bird. This in turn will make the network recognize the bird even more strongly on the next pass and so forth, until a highly detailed bird appears, seemingly out of nowhere.

An enlightening comment from the G+ thread:

This is the most fun we've had in the office in a while. We've even made some of those 'Inceptionistic' art pieces into giant posters. Beyond the eye candy, there is actually something deeply interesting in this line of work: neural networks have a bad reputation for being strange black boxes that that are opaque to inspection. I have never understood those charges: any other model (GMM, SVM, Random Forests) of any sufficient complexity for a real task is completely opaque for very fundamental reasons: their non-linear structure makes it hard to project back the function they represent into their input space and make sense of it. Not so with backprop, as this blog post shows eloquently: you can query the model and ask what it believes it is seeing or 'wants' to see simply by following gradients. This 'guided hallucination' technique is very powerful and the gorgeous visualizations it generates are very evocative of what's really going on in the network.
art  machine-learning  algorithm  inceptionism  research  google  neural-networks  learning  dreams  feedback  graphics 
june 2015 by jm
Trend Micro Locality Sensitive Hash
a fuzzy matching library. Given a byte stream with a minimum length
of 512 bytes, TLSH generates a hash value which can be used for similarity
comparisons. Similar objects will have similar hash values which allows for
the detection of similar objects by comparing their hash values. Note that
the byte stream should have a sufficient amount of complexity. For example,
a byte stream of identical bytes will not generate a hash value.

Paper here: https://drive.google.com/file/d/0B6FS3SVQ1i0GTXk5eDl3Y29QWlk/edit

via adulau
nilsimsa  sdhash  ssdeep  locality-sensitive  hashing  algorithm  hashes  trend-micro  tlsh  hash  fuzzy-matching  via:adulau 
may 2015 by jm
The Algorists
<pre>if (creation && object of art && algorithm && one's own algorithm) {
include * an algorist *
} elseif (!creation || !object of art || !algorithm || !one's own algorithm) {
exclude * not an algorist *
}</pre>
algorism  algorithm  art  algorists  via:belongio 
march 2015 by jm
3 Rules of thumb for Bloom Filters
I often need to do rough back-of-the-envelope reasoning about things, and I find that doing a bit of work to develop an intuition for how a new technique performs is usually worthwhile. So, here are three broad rules of thumb to remember when discussing Bloom filters down the pub:

One byte per item in the input set gives about a 2% false positive rate.

The optimal number of hash functions is about 0.7 times the number of bits per item.

3 - The number of hashes dominates performance.

But see also http://stackoverflow.com/a/9554448 , http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf (thanks Tony Finch!)
bloom-filters  algorithm  probabilistic  rules  reasoning  via:norman-maurer  false-positives  hashing  coding 
august 2014 by jm
Marc Brooker's "two-randoms" load balancing approach
Marc Brooker on this interesting load-balancing algorithm, including simulation results:
Using stale data for load balancing leads to a herd behavior, where requests will herd toward a previously quiet host for much longer than it takes to make that host very busy indeed. The next refresh of the cached load data will put the server high up the load list, and it will become quiet again. Then busy again as the next herd sees that it's quiet. Busy. Quiet. Busy. Quiet. And so on. One possible solution would be to give up on load balancing entirely, and just pick a host at random. Depending on the load factor, that can be a good approach. With many typical loads, though, picking a random host degrades latency and reduces throughput by wasting resources on servers which end up unlucky and quiet.

The approach taken by the studies surveyed by Mitzenmacher is to try two hosts, and pick the one with the least load. This can be done directly (by querying the hosts) but also works surprisingly well on cached load data. [...] Best of 2 is good because it combines the best of both worlds: it uses real information about load to pick a host (unlike random), but rejects herd behavior much more strongly than the other two approaches.


Having seen what Marc has worked on, and written, inside Amazon, I'd take this very seriously... cool to see he is blogging externally too.
algorithm  load-balancing  distcomp  distributed  two-randoms  marc-brooker  least-conns 
february 2013 by jm
Dr. Neal Krawetz explains perceptual hashing
ie. TinEye and other "images like this one" search engines. nice explanation
algorithm  images  analysis  programming  dct  hashing  perceptual-hash  tineye  via:hn  image 
june 2011 by jm

Copy this bookmark:



description:


tags: