nhaliday + hashing   23

Anti-hash test. - Codeforces
- Thue-Morse sequence
- nice paper: http://www.mii.lt/olympiads_in_informatics/pdf/INFOL119.pdf
In general, polynomial string hashing is a useful technique in construction of efficient string algorithms. One simply needs to remember to carefully select the modulus M and the variable of the polynomial p depending on the application. A good rule of thumb is to pick both values as prime numbers with M as large as possible so that no integer overflow occurs and p being at least the size of the alphabet.
2.2. Upper Bound on M
[stuff about 32- and 64-bit integers]
2.3. Lower Bound on M
On the other side Mis bounded due to the well-known birthday paradox: if we consider a collection of m keys with m ≥ 1.2√M then the chance of a collision to occur within this collection is at least 50% (assuming that the distribution of fingerprints is close to uniform on the set of all strings). Thus if the birthday paradox applies then one needs to choose M=ω(m^2)to have a fair chance to avoid a collision. However, one should note that not always the birthday paradox applies. As a benchmark consider the following two problems.

I generally prefer to use Schwartz-Zippel to reason about collision probabilities w/ this kind of thing, eg, https://people.eecs.berkeley.edu/~sinclair/cs271/n3.pdf.

A good way to get more accurate results: just use multiple primes and the Chinese remainder theorem to get as large an M as you need w/o going beyond 64-bit arithmetic.

more on this: https://codeforces.com/blog/entry/60442
oly  oly-programming  gotchas  howto  hashing  algorithms  strings  random  best-practices  counterexample  multi  pdf  papers  nibble  examples  fields  polynomials  lecture-notes  yoga  probability  estimate  magnitude  hacker  adversarial  CAS  lattice  discrete
8 days ago by nhaliday
Merkle tree - Wikipedia
In cryptography and computer science, a hash tree or Merkle tree is a tree in which every non-leaf node is labelled with the hash of the labels or values (in case of leaves) of its child nodes.
concept  cs  data-structures  bitcoin  cryptocurrency  blockchain  atoms  protocol  wiki  reference  nibble  hashing  ideas  crypto  rigorous-crypto
june 2017 by nhaliday
MinHash - Wikipedia
- goal: compute Jaccard coefficient J(A, B) = |A∩B| / |A∪B| in sublinear space
- idea: pick random injective hash function h, define h_min(S) = argmin_{x in S} h(x), and note that Pr[h_min(A) = h_min(B)] = J(A, B)
- reduce variance w/ Chernoff bound
algorithms  data-structures  sublinear  hashing  wiki  reference  random  tcs  nibble  measure  metric-space  metrics  similarity  PAC  intersection  intersection-connectedness
february 2017 by nhaliday
Count–min sketch - Wikipedia
- estimates frequency vector (f_i)
- idea:
d = O(log 1/δ) hash functions h_j: [n] -> [w] (w = O(1/ε))
d*w counters a[r, c]
for each event i, increment counters a[1, h_1(i)], a[2, h_2(i)], ..., a[d, h_d(i)]
estimate for f_i is min_j a[j, h_j(i)]
- never underestimates but upward-biased
- pf: Markov to get constant probability of success, then exponential decrease with repetition
lecture notes: http://theory.stanford.edu/~tim/s15/l/l2.pdf
- note this can work w/ negative updates. just use median instead of min. pf still uses markov on the absolute value of error.
algorithms  data-structures  sublinear  hashing  wiki  reference  bias-variance  approximation  random  tcs  multi  stanford  lecture-notes  pdf  tim-roughgarden  nibble  pigeonhole-markov  PAC
february 2017 by nhaliday

bundles : abstracttcs

Copy this bookmark:

description:

tags: