jm + hashing   38

Fibonacci Hashing: The Optimization that the World Forgot (or: a Better Alternative to Integer Modulo)
Turns out I was wrong. This is a big one. And everyone should be using it. Hash tables should not be prime number sized and they should not use an integer modulo to map hashes into slots. Fibonacci hashing is just better. Yet somehow nobody is using it and lots of big hash tables (including all the big implementations of std::unordered_map) are much slower than they should be because they don’t use Fibonacci Hashing.

Apparently this is binary multiplicative hashing, and Google's brotli, webp, and Snappy libs all use a constant derived heuristically from a compression test corpus along the same lines (see comments).

(Via Michael Fogleman)
algorithms  hashing  hash  fibonacci  golden-ratio  coding  hacks  brotli  webp  snappy  hash-tables  hashmaps  load-distribution
june 2018 by jm
_Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems_, ACM Transactions on Storage, July 2014
'The ever-growing amount of data requires highly scalable storage solutions. The most flexible approach is to use storage pools that can be expanded and scaled down by adding or removing storage devices. To make this approach usable, it is necessary to provide a solution to locate data items in such a dynamic environment. This article presents and evaluates the Random Slicing strategy, which incorporates lessons learned from table-based, rule-based, and pseudo-randomized hashing strategies and is able to provide a simple and efficient strategy that scales up to handle exascale data. Random Slicing keeps a small table with information about previous storage system insert and remove operations, drastically reducing the required amount of randomness while delivering a perfect load distribution.'
randomness  architecture  algorithms  storage  hashing  slicing  scaling
may 2018 by jm
google/highwayhash: Fast strong hash functions: SipHash/HighwayHash
HighwayHash: 'We have devised a new way of mixing inputs with AVX2 multiply and permute instructions. The multiplications are 32x32 -> 64 bits and therefore infeasible to reverse. Permuting equalizes the distribution of the resulting bytes. The internal state occupies four 256-bit AVX2 registers. Due to limitations of the instruction set, the registers are partitioned into two 512-bit halves that remain independent until the reduce phase. The algorithm outputs 64 bit digests or up to 256 bits at no extra cost. In addition to high throughput, the algorithm is designed for low finalization cost. The result is more than twice as fast as SipTreeHash.

We also provide an SSE4.1 version (80% as fast for large inputs and 95% as fast for short inputs), an implementation for VSX on POWER and a portable version (10% as fast). A third-party ARM implementation is referenced below.

Statistical analyses and preliminary cryptanalysis are given in https://arxiv.org/abs/1612.06257.'

(via Tony Finch)
siphash  highwayhash  via:fanf  hashing  hashes  algorithms  mac  google  hash
january 2018 by jm
This is absolutely spot on.

If Facebook wanted to implement a truly trusted system for revenge porn victims, they could put the photo hashing on the user side of things -- so only the hash is transferred to Facebook. To verify the claim that the image is truly a revenge porn issue, the victim could have the images verified through a trusted revenge porn advocacy organization. Theoretically, the victim then would have a verified, privacy-safe version of the photo, and a hash that could be also sent to Google and other sites.
facebook  privacy  hashing  pictures  images  revenge-porn  abuse  via:jwz
november 2017 by jm
Facebook asks users for nude photos in project to combat revenge porn
The photos are hashed, server-side, using the PhotoDNA hashing algorithm. This would have been way way better if it ran locally, on user's phones, instead though. Interesting to note that PhotoDNA claims to have a "1 in 10 billion" false positive rate according to https://www.itu.int/en/cop/case-studies/Documents/ICMEC_PhotoDNA.PDF
photodna  hashing  images  facebook  revenge-porn  messenger  nudes  photos
november 2017 by jm
Image comparison algorithms
Awesome StackOverflow answer for detecting "similar" images -- promising approach to reimplement ffffound's similarity feature in mltshp, maybe
algorithms  hashing  comparison  diff  images  similarity  search  ffffound  mltshp
april 2017 by jm
Accidentally Quadratic — Rust hash iteration+reinsertion
It was recently discovered that some surprising operations on Rust’s standard hash table types could go quadratic.

Quite a nice unexpected accidental detour into O(n^2)
big-o  hashing  robin-hood-hashing  siphash  algorithms  hashtables  rust
november 2016 by jm
Stop it with short PGP key IDs!
What happened today? We still don't really know, but it seems we found a first potentially malicious collision — that is, the first "nonacademic" case. Enrico found two keys sharing the 9F6C6333 short ID, apparently belonging to the same person (as would be the case of Asheesh, mentioned above). After contacting Gustavo, though, he does not know about the second — That is, it can be clearly regarded as an impersonation attempt. Besides, what gave away this attempt are the signatures it has: Both keys are signed by what appears to be the same three keys: B29B232A, F2C850CA and 789038F2. Those three keys are not (yet?) uploaded to the keyservers, though... But we can expect them to appear at any point in the future. We don't know who is behind this, or what his purpose is. We just know this looks very evil.
Now, don't panic: Gustavo's key is safe. Same for his certifiers, Marga, Agustín and Maxy. It's just a 32-bit collision. So, in principle, the only parties that could be cheated to trust the attacker are humans, right? Nope.
Enrico tested on the PGP pathfinder & key statistics service, a keyserver that finds trust paths between any two arbitrary keys in the strong set. Surprise: The pathfinder works on the short key IDs, even when supplied full fingerprints. So, it turns out I have three faked trust paths into our impostor.
pgp  gpg  keys  collisions  hashing  security  debian
june 2016 by jm
Rendezvous hashing - Wikipedia, the free encyclopedia

Rendezvous or Highest Random Weight (HRW) hashing[1][2] is an algorithm that allows clients to achieve distributed agreement on a set of k options out of a possible set of n options. A typical application is when clients need to agree on which sites (or proxies) objects are to assigned to. When k is 1, it subsumes the goals of consistent hashing, using an entirely different method.
hrw  hashing  hashes  consistent-hashing  rendezvous-hashing  algorithms  discovery  distributed-computing
april 2016 by jm
BLAKE2: simpler, smaller, fast as MD5
'We present the cryptographic hash function BLAKE2, an improved version
of the SHA-3 finalist BLAKE optimized for speed in software. Target applications include
cloud storage, intrusion detection, or version control systems. BLAKE2 comes
in two main flavors: BLAKE2b is optimized for 64-bit platforms, and BLAKE2s for
smaller architectures. On 64-bit platforms, BLAKE2 is often faster than MD5, yet provides
security similar to that of SHA-3. We specify parallel versions BLAKE2bp and
BLAKE2sp that are up to 4 and 8 times faster, by taking advantage of SIMD and/or
multiple cores. BLAKE2 has more benefits than just speed: BLAKE2 uses up to 32%
less RAM than BLAKE, and comes with a comprehensive tree-hashing mode as well
as an efficient MAC mode.'
crypto  hash  blake2  hashing  blake  algorithms  sha1  sha3  simd  performance  mac
april 2016 by jm
The general birthday problem
Good explanation and scipy code for the birthday paradox and hash collisions
hashing  hashes  collisions  birthday-problem  birthday-paradox  coding  probability  statistics
february 2016 by jm
AV vendors still relying on MD5 to identify malware
oh dear. I can see how this happened -- in many cases they may not still have samples to derive new sums from :(
md5  hashing  antivirus  malware  security  via:fanf  bugs
june 2015 by jm
Trend Micro Locality Sensitive Hash
a fuzzy matching library. Given a byte stream with a minimum length
of 512 bytes, TLSH generates a hash value which can be used for similarity
comparisons. Similar objects will have similar hash values which allows for
the detection of similar objects by comparing their hash values. Note that
the byte stream should have a sufficient amount of complexity. For example,
a byte stream of identical bytes will not generate a hash value.

nilsimsa  sdhash  ssdeep  locality-sensitive  hashing  algorithm  hashes  trend-micro  tlsh  hash  fuzzy-matching  via:adulau
may 2015 by jm
"Cuckoo Filter: Practically Better Than Bloom"
'We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership
tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than
Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have
lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'
algorithms  paper  bloom-filters  cuckoo-filters  cuckoo-hashing  data-structures  false-positives  big-data  probabilistic  hashing  set-membership  approximation
march 2015 by jm
How I created two images with the same MD5 hash
I found that I was able to run the algorithm in about 10 hours on an AWS large GPU instance bringing it in at about \$0.65 plus tax.

Bottom line: MD5 is feasibly attackable by pretty much anyone now.
crypto  images  md5  security  hashing  collisions  ec2  via:hn
november 2014 by jm
3 Rules of thumb for Bloom Filters
I often need to do rough back-of-the-envelope reasoning about things, and I find that doing a bit of work to develop an intuition for how a new technique performs is usually worthwhile. So, here are three broad rules of thumb to remember when discussing Bloom filters down the pub:

One byte per item in the input set gives about a 2% false positive rate.

The optimal number of hash functions is about 0.7 times the number of bits per item.

3 - The number of hashes dominates performance.

bloom-filters  algorithm  probabilistic  rules  reasoning  via:norman-maurer  false-positives  hashing  coding
august 2014 by jm
MinHash for dummies
A Java-oriented practical intro to the MinHash duplicate-detection shingling algo
shingling  algorithms  minhash  hashing  duplicates  duplicate-detection  fuzzy-matching  java
august 2014 by jm
NYC generates hash-anonymised data dump, which gets reversed
There are about 1000*26**3 = 21952000 or 22M possible medallion numbers. So, by calculating the md5 hashes of all these numbers (only 24M!), one can completely deanonymise the entire data. Modern computers are fast: so fast that computing the 24M hashes took less than 2 minutes.

(via Bruce Schneier)

The better fix is a HMAC (see http://benlog.com/2008/06/19/dont-hash-secrets/ ), or just to assign opaque IDs instead of hashing.
hashing  sha1  md5  bruce-schneier  anonymization  deanonymization  security  new-york  nyc  taxis  data  big-data  hmac  keyed-hashing  salting
june 2014 by jm
Jump Consistent Hash: A Fast, Minimal Memory, Consistent Hash Algorithm
'a fast, minimal memory, consistent hash algorithm that can be expressed in about 5 lines of code. In comparison to the algorithm of Karger et al., jump consistent hash requires no storage, is faster, and does a better job of evenly dividing the key space among the buckets and of evenly dividing the workload when the number of buckets changes. Its main limitation is that the buckets must be numbered sequentially, which makes it more suitable for data storage applications than for distributed web caching.'

Implemented in Guava. This is also noteworthy:

'Google has not applied for patent protection for this algorithm, and, as of this writing, has no plans to. Rather, it wishes to contribute this algorithm to the community.'
hashing  consistent-hashing  google  guava  memory  algorithms  sharding
june 2014 by jm
Shuffle Sharding
hashing  load-balancing  sharding  partitions  dist-sys  distcomp  architecture  coding
april 2014 by jm
_An Improved Construction For Counting Bloom Filters_
'A counting Bloom filter (CBF) generalizes a Bloom filter data structure so as to allow membership queries on a set that can be changing dynamically via insertions and deletions. As with a Bloom filter, a CBF obtains space savings by allowing false positives. We provide a simple hashing-based alternative based on d-left hashing called a d-left CBF (dlCBF). The dlCBF offers the same functionality as a CBF, but uses less space, generally saving a factor of two or more. We describe the construction of dlCBFs, provide an analysis, and demonstrate their effectiveness experimentally'
bloom-filter  data-structures  algorithms  counting  cbf  storage  false-positives  d-left-hashing  hashing
september 2013 by jm
Recordinality
a new, and interesting, sketching algorithm, with a Java implementation:
Recordinality is unique in that it provides cardinality estimation like HLL, but also offers "distinct value sampling." This means that Recordinality can allow us to fetch a random sample of distinct elements in a stream, invariant to cardinality. Put more succinctly, given a stream of elements containing 1,000,000 occurrences of 'A' and one occurrence each of 'B' - 'Z', the probability of any letter appearing in our sample is equal. Moreover, we can also efficiently store the number of times elements in our distinct sample have been observed. This can help us to understand the distribution of occurrences of elements in our stream. With it, we can answer questions like "do the elements we've sampled present in a power law-like pattern, or is the distribution of occurrences relatively even across the set?"
sketching  coding  algorithms  recordinality  cardinality  estimation  hll  hashing  murmurhash  java
august 2013 by jm
Lectures in Advanced Data Structures (6.851)
Good lecture notes on the current state of the art in data structure research.
Data structures play a central role in modern computer science. You interact with data structures even more often than with algorithms (think Google, your mail server, and even your network routers). In addition, data structures are essential building blocks in obtaining efficient algorithms. This course covers major results and current directions of research in data structures:

TIME TRAVEL We can remember the past efficiently (a technique called persistence), but in general it's difficult to change the past and see the outcomes on the present (retroactivity). So alas, Back To The Future isn't really possible.
GEOMETRY When data has more than one dimension (e.g. maps, database tables).
DYNAMIC OPTIMALITY Is there one binary search tree that's as good as all others? We still don't know, but we're close.
MEMORY HIERARCHY Real computers have multiple levels of caches. We can optimize the number of cache misses, often without even knowing the size of the cache.
HASHING Hashing is the most used data structure in computer science. And it's still an active area of research.
INTEGERS Logarithmic time is too easy. By careful analysis of the information you're dealing with, you can often reduce the operation times substantially, sometimes even to constant. We will also cover lower bounds that illustrate when this is not possible.
DYNAMIC GRAPHS A network link went down, or you just added or deleted a friend in a social network. We can still maintain essential information about the connectivity as it changes.
STRINGS Searching for phrases in giant text (think Google or DNA).
SUCCINCT Most “linear size” data structures you know are much larger than they need to be, often by an order of magnitude. Some data structures require almost no space beyond the raw data but are still fast (think heaps, but much cooler).

(via Tim Freeman)
data-structures  lectures  mit  video  data  algorithms  coding  csail  strings  integers  hashing  sorting  bst  memory
april 2013 by jm
Abusing hash kernels for wildly unprincipled machine learning
what, is this the first time our spam filtering approach of hashing a giant feature space is hitting mainstream machine learning? that can't be right!
ai  machine-learning  python  data  hashing  features  feature-selection  anti-spam  spamassassin
april 2013 by jm
java - Given that HashMaps in jdk1.6 and above cause problems with multi-threading, how should I fix my code - Stack Overflow
Massive Java concurrency fail in recent 1.6 and 1.7 JDK releases -- the java.util.HashMap type now spin-locks on an AtomicLong in its constructor.

Here's the response from the author: 'I'll acknowledge right up front that the initialization of hashSeed is a bottleneck but it is not one we expected to be a problem since it only happens once per Hash Map instance. For this code to be a bottleneck you would have to be creating hundreds or thousands of hash maps per second. This is certainly not typical. Is there really a valid reason for your application to be doing this? How long do these hash maps live?'

Oh dear. Assumptions of "typical" like this are not how you design a fundamental data structure. fail. For now there is a hacky reflection-based workaround, but this is lame and needs to be fixed as soon as possible. (Via cscotta)
java  hashmap  concurrency  bugs  fail  security  hashing  jdk  via:cscotta
february 2013 by jm
fail0verflow ::
Excellent demo of how use of a block cipher with a known secret key makes an insecure MAC. "In short, CBC-MAC is a Message Authentication Code, not a strong hash function. While MACs can be built out of hash functions (e.g. HMAC), and hash functions can be built out of block ciphers like AES, not all MACs are also hash functions. CBC-MAC in particular is completely unsuitable for use as a hash function, because it only allows two parties with knowledge of a particular secret key to securely transmit messages between each other. Anyone with knowledge of that key can forge the messages in a way that keeps the MAC (“hash value”) the same. All you have to do is run the forged message through CBC-MAC as usual, then use the AES decryption operation on the original hash value to find the last intermediate state. XORing this state with the CBC-MAC for the forged message yields a new block of data which, when appended to the forged message, will cause it to have the original hash value. Because the input is taken backwards, you can either modify the first block of the file, or just run the hash function backwards until you reach the block that you want to modify. You can make a forged file pass the hash check as long as you can modify an arbitrary aligned 16-byte block in it."
crypto  hashing  security  cbc  mac  sha1  aes
january 2013 by jm
SipHash: a fast short-input PRF
a family of pseudorandom functions optimized for short inputs. Target applications include network traffic authentication and hash-table lookups protected against hash-flooding denials-of-service attacks.

SipHash is simpler than MACs based on universal hashing, and faster on short inputs.

Compared to dedicated designs for hash-table lookup, SipHash has well-defined security goals and competitive performance. For example, SipHash processes a 16-byte input with a fresh key in 140 cycles on an AMD FX-8150 processor, which is much faster than state-of-the-art MACs.
hashing  siphash  djb  security  algorithms
october 2012 by jm
experimental CPU-cache-aware hash table implementations in Cloudera's Impala

'another cool piece of cloudera impala source: cpu-cache-aware hash table implementations by @jackowayed'. 'L1-sized hash table that hopes to use cache well. Each bucket is a chunk list of tuples. Each chunk is a cache line.'
hashing  hash-tables  data-structures  performance  c++  l1  cache  cpu
october 2012 by jm
Avoiding Hash Lookups in a Ruby Implementation
'If I were to sum up the past 6 years I've spent optimizing JRuby it would be with the following phrase: Get Rid Of Hash Lookups.'

This has been a particular theme of some recent optimization hacks I've been working on. Hashes may be O(1) to read, on average, but that doesn't necessarily mean they're the right tool for performance...

(via Declan McGrath)
via:declanmcgrath  hash  optimization  ruby  performance  jruby  hashing  data-structures  big-o  optimisation
september 2012 by jm
Analyzing Flame's MD5 Collision Attack [slides, PDF]
really detailed slide deck by Alex Sotirov, Co-Founder and Chief Scientist, Trail of Bits, Inc. (via Tony Finch) Plenty of security fail by MS, and also: PKI is clearly too hard
via:fanf  flame  security  malware  md5  collisions  hashing  pki  tls  ssl  microsoft
june 2012 by jm
feedback loop n-gram analyzer
'a simple parser of ARF compliant FBL complaints, which normalizes the email complaints and generates a 6-tuple n-gram version of the message. These n-grams are stored in a Redis database, keyed by the file in which they can be found. An inverse index also exists that allow you to find all messages containing a particular n-gram word.'
anti-spam  spam  fbl  feedback  filtering  n-grams  similarity  hashing  redis  searching
september 2011 by jm
Dr. Neal Krawetz explains perceptual hashing
ie. TinEye and other "images like this one" search engines. nice explanation
algorithm  images  analysis  programming  dct  hashing  perceptual-hash  tineye  via:hn  image
june 2011 by jm
'a (python) library and a tool to clusterize similar files using fuzzy hashing techniques. This project is inspired by the well known tool ssdeep.' Via Nelson
via:nelson  deeptoad  software  open-source  fuzzy  hashing  from delicious
november 2010 by jm
Stop using unsafe keyed hashes, use HMAC
why HMAC is more secure than secret-suffix and secret-prefix keyed hashing. good to know
hmac  security  crypto  hashing  md5  hashes  sha256  sha1  from delicious
october 2009 by jm

Copy this bookmark:

description:

tags: