jm + hashing   32

Image comparison algorithms
Awesome StackOverflow answer for detecting "similar" images -- promising approach to reimplement ffffound's similarity feature in mltshp, maybe
algorithms  hashing  comparison  diff  images  similarity  search  ffffound  mltshp 
14 days ago by jm
Accidentally Quadratic — Rust hash iteration+reinsertion
It was recently discovered that some surprising operations on Rust’s standard hash table types could go quadratic.

Quite a nice unexpected accidental detour into O(n^2)
big-o  hashing  robin-hood-hashing  siphash  algorithms  hashtables  rust 
november 2016 by jm
Stop it with short PGP key IDs!
What happened today? We still don't really know, but it seems we found a first potentially malicious collision — that is, the first "nonacademic" case. Enrico found two keys sharing the 9F6C6333 short ID, apparently belonging to the same person (as would be the case of Asheesh, mentioned above). After contacting Gustavo, though, he does not know about the second — That is, it can be clearly regarded as an impersonation attempt. Besides, what gave away this attempt are the signatures it has: Both keys are signed by what appears to be the same three keys: B29B232A, F2C850CA and 789038F2. Those three keys are not (yet?) uploaded to the keyservers, though... But we can expect them to appear at any point in the future. We don't know who is behind this, or what his purpose is. We just know this looks very evil.
Now, don't panic: Gustavo's key is safe. Same for his certifiers, Marga, Agustín and Maxy. It's just a 32-bit collision. So, in principle, the only parties that could be cheated to trust the attacker are humans, right? Nope.
Enrico tested on the PGP pathfinder & key statistics service, a keyserver that finds trust paths between any two arbitrary keys in the strong set. Surprise: The pathfinder works on the short key IDs, even when supplied full fingerprints. So, it turns out I have three faked trust paths into our impostor.
pgp  gpg  keys  collisions  hashing  security  debian 
june 2016 by jm
Rendezvous hashing - Wikipedia, the free encyclopedia

Rendezvous or Highest Random Weight (HRW) hashing[1][2] is an algorithm that allows clients to achieve distributed agreement on a set of k options out of a possible set of n options. A typical application is when clients need to agree on which sites (or proxies) objects are to assigned to. When k is 1, it subsumes the goals of consistent hashing, using an entirely different method.
hrw  hashing  hashes  consistent-hashing  rendezvous-hashing  algorithms  discovery  distributed-computing 
april 2016 by jm
BLAKE2: simpler, smaller, fast as MD5
'We present the cryptographic hash function BLAKE2, an improved version
of the SHA-3 finalist BLAKE optimized for speed in software. Target applications include
cloud storage, intrusion detection, or version control systems. BLAKE2 comes
in two main flavors: BLAKE2b is optimized for 64-bit platforms, and BLAKE2s for
smaller architectures. On 64-bit platforms, BLAKE2 is often faster than MD5, yet provides
security similar to that of SHA-3. We specify parallel versions BLAKE2bp and
BLAKE2sp that are up to 4 and 8 times faster, by taking advantage of SIMD and/or
multiple cores. BLAKE2 has more benefits than just speed: BLAKE2 uses up to 32%
less RAM than BLAKE, and comes with a comprehensive tree-hashing mode as well
as an efficient MAC mode.'
crypto  hash  blake2  hashing  blake  algorithms  sha1  sha3  simd  performance  mac 
april 2016 by jm
The general birthday problem
Good explanation and scipy code for the birthday paradox and hash collisions
hashing  hashes  collisions  birthday-problem  birthday-paradox  coding  probability  statistics 
february 2016 by jm
AV vendors still relying on MD5 to identify malware
oh dear. I can see how this happened -- in many cases they may not still have samples to derive new sums from :(
md5  hashing  antivirus  malware  security  via:fanf  bugs 
june 2015 by jm
Trend Micro Locality Sensitive Hash
a fuzzy matching library. Given a byte stream with a minimum length
of 512 bytes, TLSH generates a hash value which can be used for similarity
comparisons. Similar objects will have similar hash values which allows for
the detection of similar objects by comparing their hash values. Note that
the byte stream should have a sufficient amount of complexity. For example,
a byte stream of identical bytes will not generate a hash value.

Paper here:

via adulau
nilsimsa  sdhash  ssdeep  locality-sensitive  hashing  algorithm  hashes  trend-micro  tlsh  hash  fuzzy-matching  via:adulau 
may 2015 by jm
"Cuckoo Filter: Practically Better Than Bloom"
'We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership
tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than
Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have
lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'
algorithms  paper  bloom-filters  cuckoo-filters  cuckoo-hashing  data-structures  false-positives  big-data  probabilistic  hashing  set-membership  approximation 
march 2015 by jm
How I created two images with the same MD5 hash
I found that I was able to run the algorithm in about 10 hours on an AWS large GPU instance bringing it in at about $0.65 plus tax.

Bottom line: MD5 is feasibly attackable by pretty much anyone now.
crypto  images  md5  security  hashing  collisions  ec2  via:hn 
november 2014 by jm
3 Rules of thumb for Bloom Filters
I often need to do rough back-of-the-envelope reasoning about things, and I find that doing a bit of work to develop an intuition for how a new technique performs is usually worthwhile. So, here are three broad rules of thumb to remember when discussing Bloom filters down the pub:

One byte per item in the input set gives about a 2% false positive rate.

The optimal number of hash functions is about 0.7 times the number of bits per item.

3 - The number of hashes dominates performance.

But see also , (thanks Tony Finch!)
bloom-filters  algorithm  probabilistic  rules  reasoning  via:norman-maurer  false-positives  hashing  coding 
august 2014 by jm
MinHash for dummies
A Java-oriented practical intro to the MinHash duplicate-detection shingling algo
shingling  algorithms  minhash  hashing  duplicates  duplicate-detection  fuzzy-matching  java 
august 2014 by jm
NYC generates hash-anonymised data dump, which gets reversed
There are about 1000*26**3 = 21952000 or 22M possible medallion numbers. So, by calculating the md5 hashes of all these numbers (only 24M!), one can completely deanonymise the entire data. Modern computers are fast: so fast that computing the 24M hashes took less than 2 minutes.

(via Bruce Schneier)

The better fix is a HMAC (see ), or just to assign opaque IDs instead of hashing.
hashing  sha1  md5  bruce-schneier  anonymization  deanonymization  security  new-york  nyc  taxis  data  big-data  hmac  keyed-hashing  salting 
june 2014 by jm
Jump Consistent Hash: A Fast, Minimal Memory, Consistent Hash Algorithm
'a fast, minimal memory, consistent hash algorithm that can be expressed in about 5 lines of code. In comparison to the algorithm of Karger et al., jump consistent hash requires no storage, is faster, and does a better job of evenly dividing the key space among the buckets and of evenly dividing the workload when the number of buckets changes. Its main limitation is that the buckets must be numbered sequentially, which makes it more suitable for data storage applications than for distributed web caching.'

Implemented in Guava. This is also noteworthy:

'Google has not applied for patent protection for this algorithm, and, as of this writing, has no plans to. Rather, it wishes to contribute this algorithm to the community.'
hashing  consistent-hashing  google  guava  memory  algorithms  sharding 
june 2014 by jm
Shuffle Sharding
Colm MacCarthaigh writes about a simple sharding/load-balancing algorithm which uses randomized instance selection and optional additional compartmentalization. See also: continuous hashing, and
hashing  load-balancing  sharding  partitions  dist-sys  distcomp  architecture  coding 
april 2014 by jm
_An Improved Construction For Counting Bloom Filters_
'A counting Bloom filter (CBF) generalizes a Bloom filter data structure so as to allow membership queries on a set that can be changing dynamically via insertions and deletions. As with a Bloom filter, a CBF obtains space savings by allowing false positives. We provide a simple hashing-based alternative based on d-left hashing called a d-left CBF (dlCBF). The dlCBF offers the same functionality as a CBF, but uses less space, generally saving a factor of two or more. We describe the construction of dlCBFs, provide an analysis, and demonstrate their effectiveness experimentally'
bloom-filter  data-structures  algorithms  counting  cbf  storage  false-positives  d-left-hashing  hashing 
september 2013 by jm
a new, and interesting, sketching algorithm, with a Java implementation:
Recordinality is unique in that it provides cardinality estimation like HLL, but also offers "distinct value sampling." This means that Recordinality can allow us to fetch a random sample of distinct elements in a stream, invariant to cardinality. Put more succinctly, given a stream of elements containing 1,000,000 occurrences of 'A' and one occurrence each of 'B' - 'Z', the probability of any letter appearing in our sample is equal. Moreover, we can also efficiently store the number of times elements in our distinct sample have been observed. This can help us to understand the distribution of occurrences of elements in our stream. With it, we can answer questions like "do the elements we've sampled present in a power law-like pattern, or is the distribution of occurrences relatively even across the set?"
sketching  coding  algorithms  recordinality  cardinality  estimation  hll  hashing  murmurhash  java 
august 2013 by jm
Lectures in Advanced Data Structures (6.851)
Good lecture notes on the current state of the art in data structure research.
Data structures play a central role in modern computer science. You interact with data structures even more often than with algorithms (think Google, your mail server, and even your network routers). In addition, data structures are essential building blocks in obtaining efficient algorithms. This course covers major results and current directions of research in data structures:

TIME TRAVEL We can remember the past efficiently (a technique called persistence), but in general it's difficult to change the past and see the outcomes on the present (retroactivity). So alas, Back To The Future isn't really possible.
GEOMETRY When data has more than one dimension (e.g. maps, database tables).
DYNAMIC OPTIMALITY Is there one binary search tree that's as good as all others? We still don't know, but we're close.
MEMORY HIERARCHY Real computers have multiple levels of caches. We can optimize the number of cache misses, often without even knowing the size of the cache.
HASHING Hashing is the most used data structure in computer science. And it's still an active area of research.
INTEGERS Logarithmic time is too easy. By careful analysis of the information you're dealing with, you can often reduce the operation times substantially, sometimes even to constant. We will also cover lower bounds that illustrate when this is not possible.
DYNAMIC GRAPHS A network link went down, or you just added or deleted a friend in a social network. We can still maintain essential information about the connectivity as it changes.
STRINGS Searching for phrases in giant text (think Google or DNA).
SUCCINCT Most “linear size” data structures you know are much larger than they need to be, often by an order of magnitude. Some data structures require almost no space beyond the raw data but are still fast (think heaps, but much cooler).

(via Tim Freeman)
data-structures  lectures  mit  video  data  algorithms  coding  csail  strings  integers  hashing  sorting  bst  memory 
april 2013 by jm
Abusing hash kernels for wildly unprincipled machine learning
what, is this the first time our spam filtering approach of hashing a giant feature space is hitting mainstream machine learning? that can't be right!
ai  machine-learning  python  data  hashing  features  feature-selection  anti-spam  spamassassin 
april 2013 by jm
java - Given that HashMaps in jdk1.6 and above cause problems with multi-threading, how should I fix my code - Stack Overflow
Massive Java concurrency fail in recent 1.6 and 1.7 JDK releases -- the java.util.HashMap type now spin-locks on an AtomicLong in its constructor.

Here's the response from the author: 'I'll acknowledge right up front that the initialization of hashSeed is a bottleneck but it is not one we expected to be a problem since it only happens once per Hash Map instance. For this code to be a bottleneck you would have to be creating hundreds or thousands of hash maps per second. This is certainly not typical. Is there really a valid reason for your application to be doing this? How long do these hash maps live?'

Oh dear. Assumptions of "typical" like this are not how you design a fundamental data structure. fail. For now there is a hacky reflection-based workaround, but this is lame and needs to be fixed as soon as possible. (Via cscotta)
java  hashmap  concurrency  bugs  fail  security  hashing  jdk  via:cscotta 
february 2013 by jm
fail0verflow ::
Excellent demo of how use of a block cipher with a known secret key makes an insecure MAC. "In short, CBC-MAC is a Message Authentication Code, not a strong hash function. While MACs can be built out of hash functions (e.g. HMAC), and hash functions can be built out of block ciphers like AES, not all MACs are also hash functions. CBC-MAC in particular is completely unsuitable for use as a hash function, because it only allows two parties with knowledge of a particular secret key to securely transmit messages between each other. Anyone with knowledge of that key can forge the messages in a way that keeps the MAC (“hash value”) the same. All you have to do is run the forged message through CBC-MAC as usual, then use the AES decryption operation on the original hash value to find the last intermediate state. XORing this state with the CBC-MAC for the forged message yields a new block of data which, when appended to the forged message, will cause it to have the original hash value. Because the input is taken backwards, you can either modify the first block of the file, or just run the hash function backwards until you reach the block that you want to modify. You can make a forged file pass the hash check as long as you can modify an arbitrary aligned 16-byte block in it."
crypto  hashing  security  cbc  mac  sha1  aes 
january 2013 by jm
SipHash: a fast short-input PRF
a family of pseudorandom functions optimized for short inputs. Target applications include network traffic authentication and hash-table lookups protected against hash-flooding denials-of-service attacks.

SipHash is simpler than MACs based on universal hashing, and faster on short inputs.

Compared to dedicated designs for hash-table lookup, SipHash has well-defined security goals and competitive performance. For example, SipHash processes a 16-byte input with a fresh key in 140 cycles on an AMD FX-8150 processor, which is much faster than state-of-the-art MACs.
hashing  siphash  djb  security  algorithms 
october 2012 by jm
experimental CPU-cache-aware hash table implementations in Cloudera's Impala
via Todd Lipcon --

'another cool piece of cloudera impala source: cpu-cache-aware hash table implementations by @jackowayed'. 'L1-sized hash table that hopes to use cache well. Each bucket is a chunk list of tuples. Each chunk is a cache line.'
hashing  hash-tables  data-structures  performance  c++  l1  cache  cpu 
october 2012 by jm
Avoiding Hash Lookups in a Ruby Implementation
'If I were to sum up the past 6 years I've spent optimizing JRuby it would be with the following phrase: Get Rid Of Hash Lookups.'

This has been a particular theme of some recent optimization hacks I've been working on. Hashes may be O(1) to read, on average, but that doesn't necessarily mean they're the right tool for performance...

(via Declan McGrath)
via:declanmcgrath  hash  optimization  ruby  performance  jruby  hashing  data-structures  big-o  optimisation 
september 2012 by jm
Analyzing Flame's MD5 Collision Attack [slides, PDF]
really detailed slide deck by Alex Sotirov, Co-Founder and Chief Scientist, Trail of Bits, Inc. (via Tony Finch) Plenty of security fail by MS, and also: PKI is clearly too hard
via:fanf  flame  security  malware  md5  collisions  hashing  pki  tls  ssl  microsoft 
june 2012 by jm
feedback loop n-gram analyzer
'a simple parser of ARF compliant FBL complaints, which normalizes the email complaints and generates a 6-tuple n-gram version of the message. These n-grams are stored in a Redis database, keyed by the file in which they can be found. An inverse index also exists that allow you to find all messages containing a particular n-gram word.'
anti-spam  spam  fbl  feedback  filtering  n-grams  similarity  hashing  redis  searching 
september 2011 by jm
Dr. Neal Krawetz explains perceptual hashing
ie. TinEye and other "images like this one" search engines. nice explanation
algorithm  images  analysis  programming  dct  hashing  perceptual-hash  tineye  via:hn  image 
june 2011 by jm
deeptoad - Project Hosting on Google Code
'a (python) library and a tool to clusterize similar files using fuzzy hashing techniques. This project is inspired by the well known tool ssdeep.' Via Nelson
via:nelson  deeptoad  software  open-source  fuzzy  hashing  from delicious
november 2010 by jm
Stop using unsafe keyed hashes, use HMAC
why HMAC is more secure than secret-suffix and secret-prefix keyed hashing. good to know
hmac  security  crypto  hashing  md5  hashes  sha256  sha1  from delicious
october 2009 by jm

related tags

aes  ai  algorithm  algorithms  analysis  anonymization  anti-spam  antivirus  approximation  architecture  big-data  big-o  birthday-paradox  birthday-problem  blake  blake2  bloom-filter  bloom-filters  bruce-schneier  bst  bugs  c++  cache  calculators  cardinality  cbc  cbf  coding  collision  collisions  comparison  concurrency  consistent-hashing  counting  cpu  crypto  csail  cuckoo-filters  cuckoo-hashing  d-left-hashing  data  data-structures  dct  deanonymization  debian  deeptoad  diff  discovery  dist-sys  distcomp  distributed-computing  djb  duplicate-detection  duplicates  ec2  estimation  fail  false-positives  fbl  feature-selection  features  feedback  ffffound  filtering  flame  fuzzy  fuzzy-matching  google  gpg  guava  hash  hash-tables  hashes  hashing  hashmap  hashtables  hll  hmac  hrw  hyperloglog  image  images  integers  java  jdk  jruby  keyed-hashing  keys  l1  lectures  load-balancing  locality-sensitive  mac  machine-learning  malware  md5  memory  microsoft  minhash  mit  mltshp  murmurhash  n-grams  new-york  nilsimsa  nyc  open-source  optimisation  optimization  paper  partitions  perceptual-hash  performance  pgp  pki  probabilistic  probability  programming  python  random  reasoning  recordinality  redis  rendezvous-hashing  risk  robin-hood-hashing  ruby  rules  rust  salting  sdhash  search  searching  security  set-membership  sha  sha1  sha3  sha256  sharding  shingling  simd  similarity  siphash  sketching  software  sorting  spam  spamassassin  ssdeep  ssl  statistics  storage  strings  taxis  tineye  tips  tls  tlsh  trend-micro  via:adulau  via:cscotta  via:declanmcgrath  via:fanf  via:hn  via:jzawodny  via:nelson  via:norman-maurer  video 

Copy this bookmark: