jm + searching   6

The Bkd Tree
good explanation of this new data structure for searching multidimensional data
search  lucene  bkd-trees  searching  data-structures 
january 2016 by jm
Efficient substring searching
This is a couple of years old, but I like this:
Turbo Boyer-Moore is disappointing, its name doesn’t do it justice. In academia constant overhead doesn’t matter, but here we see that it matters a lot in practice. Turbo Boyer-Moore’s inner loop is so complex that we think we’re better off using the original Boyer-Moore.

A good demo of how large values of O(n) can be slower than small values of O(mn).
algorithms  search  strings  coding  big-o  string-search  searching 
march 2014 by jm
How the search for flight AF447 used Bayesian inference
Via jgc, the search for the downed Air France flight was optimized using this technique:

'Metron’s approach to this search planning problem is rooted in classical Bayesian inference,
which allows organization of available data with associated uncertainties and computation of the
Probability Distribution Function (PDF) for target location given these data. In following this
approach, the first step was to gather the available information about the location of the impact site
of the aircraft. This information was sometimes contradictory and filled with ambiguities and
uncertainties. Using a Bayesian approach we organized this material into consistent scenarios,
quantified the uncertainties with probability distributions, weighted the relative likelihood of each
scenario, and performed a simulation to produce a prior PDF for the location of the wreck.'
metron  bayes  bayesian-inference  machine-learning  statistics  via:jgc  air-france  disasters  probability  inference  searching 
march 2014 by jm
feedback loop n-gram analyzer
'a simple parser of ARF compliant FBL complaints, which normalizes the email complaints and generates a 6-tuple n-gram version of the message. These n-grams are stored in a Redis database, keyed by the file in which they can be found. An inverse index also exists that allow you to find all messages containing a particular n-gram word.'
anti-spam  spam  fbl  feedback  filtering  n-grams  similarity  hashing  redis  searching 
september 2011 by jm
Dutch grepping Facebook for welfare fraud
'The [Dutch] councils are working with a specialist Amsterdam research firm, using the type of computer software previously deployed only in counterterrorism, monitoring [LinkedIn, Facebook and Twitter] traffic for keywords and cross-referencing any suspicious information with digital lists of social welfare recipients.

Among the giveaway terms, apparently, are “holiday” and “new car”. If the automated software finds a match between one of these terms and a person claiming social welfare payments, the information is passed on to investigators to gather real-life evidence.' With a 30% false positive rate, apparently -- let's hope those investigations aren't too intrusive!
grep  dutch  holland  via:tjmcintyre  privacy  facebook  twitter  linkedin  welfare  dole  fraud  false-positives  searching 
september 2011 by jm

Copy this bookmark: