jm + search   42

Google used a Baidu front-end to scrape user searches without consent
The engineers used the data they pulled from [acquired Baidu front-end site] to learn about the kinds of things that people located in mainland China routinely search for in Mandarin. This helped them to build a prototype of Dragonfly. The engineers used the sample queries from, for instance, to review lists of websites Chinese people would see if they typed the same word or phrase into Google. They then used a tool they called “BeaconTower” to check whether any websites in the Google search results would be blocked by China’s internet censorship system, known as the Great Firewall. Through this process, the engineers compiled a list of thousands of banned websites, which they integrated into the Dragonfly search platform so that it would purge links to websites prohibited in China, such as those of the online encyclopedia Wikipedia and British news broadcaster BBC.

Under normal company protocol, analysis of people’s search queries is subject to tight constraints and should be reviewed by the company’s privacy staff, whose job is to safeguard user rights. But the privacy team only found out about the data access after The Intercept revealed it, and were “really pissed,” according to one Google source.
china  search  tech  google  privacy  baidu  interception  censorship  great-firewall  dragonfly 
4 weeks ago by jm
Image comparison algorithms
Awesome StackOverflow answer for detecting "similar" images -- promising approach to reimplement ffffound's similarity feature in mltshp, maybe
algorithms  hashing  comparison  diff  images  similarity  search  ffffound  mltshp 
april 2017 by jm
How Google Book Search Got Lost – Backchannel
There are plenty of other explanations for the dampening of Google’s ardor: The bad taste left from the lawsuits. The rise of shiny and exciting new ventures with more immediate payoffs. And also: the dawning realization that Scanning All The Books, however useful, might not change the world in any fundamental way.
books  reading  google  library  lawsuits  legal  scanning  book-search  search 
april 2017 by jm
Building a Regex Search Engine for DNA | Hacker News
The original post is pretty mediocre -- a search engine which handles a corpus of "thousands" of plasmids from "a scientist's personal library", and which doesn't handle fuzzy matches? I think that's called grep -- but the HN comments are good
grep  regular-expressions  hacker-news  strings  dna  genomics  search  elasticsearch 
april 2016 by jm
The Bkd Tree
good explanation of this new data structure for searching multidimensional data
search  lucene  bkd-trees  searching  data-structures 
january 2016 by jm
User data plundering by Android and iOS apps is as rampant as you suspected
An app from, meanwhile, sent the medical search terms "herpes" and "interferon" to five domains, including,,,, and, although those domains didn't receive other personal information.
privacy  security  google  tracking  mobile  phones  search  pii 
november 2015 by jm
Elasticsearch and data loss
"@alexbfree @ThijsFeryn [ElasticSearch is] fine as long as data loss is acceptable. . We lose ~1% of all writes on average."
elasticsearch  data-loss  reliability  data  search  aphyr  jepsen  testing  distributed-systems  ops 
october 2015 by jm
SQL on Kafka using PipelineDB
this is quite nice. PipelineDB allows direct hookup of a Kafka stream, and will ingest durably and reliably, and provide SQL views computed over a sliding window of the stream.
logging  sql  kafka  pipelinedb  streaming  sliding-window  databases  search  querying 
september 2015 by jm
Choco is [FOSS] dedicated to Constraint Programming[2]. It is a Java library written under BSD license. It aims at describing hard combinatorial problems in the form of Constraint Satisfaction Problems and solving them with Constraint Programming techniques. The user models its problem in a declarative way by stating the set of constraints that need to be satisfied in every solution. Then, Choco solves the problem by alternating constraint filtering algorithms with a search mechanism. [...]
Choco is among the fastest CP solvers on the market. In 2013 and 2014, Choco has been awarded many medals at the MiniZinc challenge that is the world-wide competition of constraint-programming solvers.
choco  constraint-programming  solving  search  combinatorial  algorithms 
august 2015 by jm
Google Flights
oh look, Google has a flight search engine! I had no idea
google  flights  travel  search  holidays 
july 2015 by jm
'Simplistic interactive filtering tool' -- live incremental-search filtering in a terminal window
cli  shell  terminal  tools  go  peco  interactive  incremental-search  search  ui  unix 
june 2015 by jm
Levenshtein automata can be simple and fast
Nice algorithm for fuzzy text search with a limited Levenshtein edit distance using a DFA
dfa  algorithms  levenshtein  text  edit-distance  fuzzy-search  search  python 
june 2015 by jm
Memory Layouts for Binary Search
Key takeaway:
Nearly uni­ver­sally, B-trees win when the data gets big enough.
caches  cpu  performance  optimization  memory  binary-search  b-trees  algorithms  search  memory-layout 
may 2015 by jm
Your Google Algorithm Cheat Sheet: Panda, Penguin, and Hummingbird
Interesting that GOOG are still doing these big-bang releases -- I guess crunching the data to come up with new weights/rules is a heavyweight, time-consuming process
google  search  ranking  releases  panda  penguin  hummingbird  weighting 
may 2015 by jm
Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers
How to build an Intelligent Personal Assistant:

'Sirius is an open end-to-end standalone speech and vision based intelligent personal assistant (IPA) similar to Apple’s Siri, Google’s Google Now, Microsoft’s Cortana, and Amazon’s Echo. Sirius implements the core functionalities of an IPA including speech recognition, image matching, natural language processing and a question-and-answer system. Sirius is developed by Clarity Lab at the University of Michigan. Sirius is published at the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2015.'
sirius  siri  cortana  google-now  echo  ok-google  ipa  assistants  search  video  audio  speech  papers  clarity  nlp  wikipedia 
april 2015 by jm
Ag: faster than Ack
Some nice performance tricks; I particularly like the use of sljit:
Ag uses Pthreads to take advantage of multiple CPU cores and search files in parallel.
Files are mmap()ed instead of read into a buffer.
Literal string searching uses Boyer-Moore strstr.
Regex searching uses PCRE's JIT compiler (if Ag is built with PCRE >=8.21).
Ag calls pcre_study() before executing the same regex on every file.
Instead of calling fnmatch() on every pattern in your ignore files, non-regex patterns are loaded into arrays and binary searched.
jit  cli  grep  search  ack  ag  unix  pcre  sljit  boyer-moore  tools 
march 2015 by jm
A desktop app for finding and inserting GIFs into any conversation

Oh yes.
animated  gif  search  pictures  slack  ani-gif  via:bwalsh 
november 2014 by jm
Building a complete Tweet index
Twitter's new massive-scale twitter search backend. Sharding galore
architecture  search  twitter  sharding  earlybird 
november 2014 by jm
"A command-line power tool for Twitter." It really is -- much better timeline searchability than the "real" Twitter UI, for example
twitter  ruby  github  cli  tools  unix  search 
october 2014 by jm
Google forced to e-forget a company worldwide
Here we go.... Canadian company wins case to censor search results for its competitors.
When Google argued that Canadian law couldn't be applied to the entire world, the court responded by citing British Columbia's Law and Equity Act, which grants broad power for a court to issue injunctions when it's "just or convenient that the order should be made."

Google also tried to argue against the injunction on the basis of it amounting to censorship. The court responded that there are already entire categories of content that get censored, such as child abuse imagery.

Will this be the first of a new wave of requests for company website take-downs?

Via stx.
canada  via:stx  censorship  google  search  takedowns  datalink  equustek  gw1000  hardware 
june 2014 by jm
Ryanair drops out of top Google flight search results after website overhaul | Business |
They've done the classic website-redesign screwup -- omitted redirects from the old URLs.
Sam Silverwood-Cope, director of Intelligent Positioning, said: "They've ignored the legacy of the old It's quite startling. They are doing it just before their busiest time of the year." A change in [URLs] without proper redirects means many results found by Google now simply return error pages, he added. "Unless redirects get put in pretty soon, the position is going to get worse and worse."
ryanair  inept  fail  funny  via:christinebohan  web  google  search  redirects 
april 2014 by jm
Efficient substring searching
This is a couple of years old, but I like this:
Turbo Boyer-Moore is disappointing, its name doesn’t do it justice. In academia constant overhead doesn’t matter, but here we see that it matters a lot in practice. Turbo Boyer-Moore’s inner loop is so complex that we think we’re better off using the original Boyer-Moore.

A good demo of how large values of O(n) can be slower than small values of O(mn).
algorithms  search  strings  coding  big-o  string-search  searching 
march 2014 by jm
a compressed full-text substring index based on the Burrows-Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,[1] who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for 'Full-text index in Minute space'. It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. Both the query time and storage space requirements are sublinear with respect to the size of the input data.

kragen notes 'gene sequencing is using [them] in production'.
sequencing  bioinformatics  algorithms  bowtie  fm-index  indexing  compression  search  burrows-wheeler  bwt  full-text-search 
march 2014 by jm
Interview with the Github Elasticsearch Team
good background on Github's Elasticsearch scaling efforts. Some rather horrific split-brain problems under load, and crashes due to OpenJDK bugs (sounds like OpenJDK *still* isn't ready for production). painful
elasticsearch  github  search  ops  scaling  split-brain  outages  openjdk  java  jdk  jvm 
september 2013 by jm
Introducing Kale « Code as Craft
Etsy have implemented a tool to perform auto-correlation of service metrics, and detection of deviation from historic norms:
at Etsy, we really love to make graphs. We graph everything! Anywhere we can slap a StatsD call, we do. As a result, we’ve found ourselves with over a quarter million distinct metrics. That’s far too many graphs for a team of 150 engineers to watch all day long! And even if you group metrics into dashboards, that’s still an awful lot of dashboards if you want complete coverage. Of course, if a graph isn’t being watched, it might misbehave and no one would know about it. And even if someone caught it, lots of other graphs might be misbehaving in similar ways, and chances are low that folks would make the connection.

We’d like to introduce you to the Kale stack, which is our attempt to fix both of these problems. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.

It'll be interesting to see if they can get this working well. I've found it can be tricky to get working with low false positives, without massive volume to "smooth out" spikes caused by normal activity. Amazon had one particularly successful version driving severity-1 order drop alarms, but it used massive event volumes and still had periodic false positives. Skyline looks like it will alarm on a single anomalous data point, and in the comments Abe notes "our algorithms err on the side of noise and so alerting would be very noisy."
etsy  monitoring  service-metrics  alarming  deviation  correlation  data  search  graphs  oculus  skyline  kale  false-positives 
june 2013 by jm
Lucene 4 - Revisiting Problems For Speed [slides]
a Presentation from Simon Willnauer on optimization work performed on Lucene in 2011. The most interesting stuff here is the work done to replace an O(n^2) FuzzyQuery fuzzy-match algorithm with a FSM trie is extremely cool -- benchmarked at 214 times faster!
benchmarks  slides  lucene  search  fuzzy-matching  text-matching  strings  algorithms  coding  fsm  tries 
april 2013 by jm
FastBit: An Efficient Compressed Bitmap Index Technology
an [LGPL] open-source data processing library following the spirit of NoSQL movement. It offers a set of searching functions supported by compressed bitmap indexes. It treats user data in the column-oriented manner similar to well-known database management systems such as Sybase IQ, MonetDB, and Vertica. It is designed to accelerate user's data selection tasks without imposing undue requirements. In particular, the user data is NOT required to be under the control of FastBit software, which allows the user to continue to use their existing data analysis tools.

The key technology underlying the FastBit software is a set of compressed bitmap indexes. In database systems, an index is a data structure to accelerate data accesses and reduce the query response time. Most of the commonly used indexes are variants of the B-tree, such as B+-tree and B*-tree. FastBit implements a set of alternative indexes called compressed bitmap indexes. Compared with B-tree variants, these indexes provide very efficient searching and retrieval operations, but are somewhat slower to update after a modification of an individual record.

A key innovation in FastBit is the Word-Aligned Hybrid compression (WAH) for the bitmaps.[...] Another innovation in FastBit is the multi-level bitmap encoding methods.
fastbit  nosql  algorithms  indexing  search  compressed-bitmaps  indexes  wah  bitmaps  compression 
april 2013 by jm
DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing
thumbs-up for DNSMadeEasy's Global Traffic Director anycast-based geographically-segmented DNS service, in particular
dns  architecture  scalability  search  duckduckgo  geoip  anycast 
january 2013 by jm
'Our goal is to create the world's fastest extendable, non-transactional time series database for big data (you know, for kids)! Log file indexing is our initial focus. For example append only ASCII files produced by libraries like Log4J, or containing FIX messages or JSON objects. Occursions was built by a small team sick of creating hacks to remotely copy and/or grep through tons of large log files. We use it to index around a terabyte of new log data per day. Occursions asynchronously tails log files and indexes the individual lines in each log file as each line is written to disk so you don't even have to wait for a second after an event happens to search for it. Occursions uses custom disk backed data structures to create and search its indexes so it is very efficient at using CPU, memory and disk.'
logs  search  tsd  big-data  log4j  via:proggit 
march 2012 by jm
Near Neighbor Search in High Dimensional Data [PDF]
Detect near-duplicates; would be good for future Razor-like efficient near-duplicate detection. (slides)
slides  algorithms  email  performance  programming  near-neighbour-search  search 
february 2012 by jm
Turbocharging Solr Index Replication with BitTorrent
Etsy now replicating their multi-GB search index across the search farm using BitTorrent. Why not Multicast? 'multicast rsync caused an epic failure for our network, killing the entire site for several minutes. The multicast traffic saturated the CPU on our core switches causing all of Etsy to be unreachable.' fun!
etsy  multicast  sev1  bittorrent  search  solr  rsync  scaling  outages 
february 2012 by jm
The first Irish case on defamation via autocomplete
Google Instant has picked up people searching for 'Ballymascanlon hotel receivership' and is now offering this as an autocomplete option -- cue defamation lawsuit. Defamation via machine learning
machine-learning  defamation  google  google-instant  search  ballymascanlon  hotels  autocomplete  law-enforcement 
june 2011 by jm
Lucene Utilities and Bloom Filters - Greplin:tech
'Storing 50,000 2.5KB items in a traditional hash set requires over 125MB, but if you're willing to accept a 1-in-10,000 false positive rate on lookups, [this] bloom filter requires under 500KB' - interesting variation on the basic concept.  Java, Apache-licensed
search  bloom-filters  greplin  open-source  apache  false-positives  from delicious
april 2011 by jm
extensive. the NSFW words that Google Instant won't search for (via Waxy)
nsfw  censorship  filtering  google  keywords  search  blacklist  google-instant  from delicious
september 2010 by jm
Interpolation search
neat search algo, via Jeremy Zawodny; can be more efficient than binary search (O(log log n)), for indexed, ordered arrays, at the cost of more computation per iteration
algorithms  programming  search  via:jzawodny  from delicious
july 2010 by jm
A fast, fuzzy, full-text index using Redis
quite easy, using a Metaphone sound-like indexing scheme to provide the fuzz
metaphone  sounds-like  indexing  python  redis  search  full-text  fuzzy  from delicious
may 2010 by jm
Search results for on Delicious
wow, you can search a time period for everyone who bookmarked pages on a specific site (via Britta)
delicious  search  nifty  tools  egosurfing  via:britta  from delicious
february 2010 by jm
nifty; Apache-licensed distributed, RESTful, JSON-over-HTTP, schemaless search server with multi-tenancy
search  distributed  rest  json  apache  elasticsearch  http  from delicious
february 2010 by jm

related tags

ack  ag  alarming  algorithms  amazon  ani-gif  animated  anycast  apache  aphyr  architecture  assistants  audio  autocomplete  aws  b-trees  baidu  ballymascanlon  benchmarks  big-data  big-o  binary-search  bioinformatics  bitmaps  bittorrent  bkd-trees  blacklist  bloom-filters  book-search  books  bowtie  boyer-moore  burrows-wheeler  burrows-wheeler-transform  bwt  caches  canada  censorship  china  choco  clarity  cli  cloudsearch  coding  combinatorial  comparison  compressed-bitmaps  compression  constraint-programming  correlation  cortana  cpu  data  data-loss  data-structures  databases  datalink  defamation  delicious  deviation  dfa  diff  distributed  distributed-systems  dna  dns  dragonfly  duckduckgo  earlybird  echo  edit-distance  egosurfing  elasticsearch  email  equustek  etsy  fail  false-positives  fastbit  ffffound  filtering  flights  fm-index  fsm  full-text  full-text-search  funny  fuzzy  fuzzy-matching  fuzzy-search  genome  genomics  geoip  gif  github  go  google  google-instant  google-now  graphs  great-firewall  grep  greplin  gw1000  hacker-news  hardware  hashing  holidays  honeypots  hotels  http  hublog  hummingbird  images  incremental-search  indexes  indexing  inept  interactive  interception  ipa  java  jdk  jepsen  jit  json  jvm  kafka  kale  keywords  lame  law-enforcement  lawsuits  legal  levenshtein  library  log4j  logging  logs  lucene  machine-learning  memory  memory-layout  metaphone  microsoft  mltshp  mobile  monitoring  multicast  near-neighbour-search  nifty  nlp  nosql  nsfw  oculus  ok-google  open-source  openjdk  ops  optimization  outages  panda  papers  pcre  peco  penguin  performance  phones  pictures  pii  pipelinedb  privacy  programming  python  querying  ranking  reading  redirects  redis  regular-expressions  releases  reliability  rest  rsync  ruby  ryanair  scalability  scaling  scanning  search  searching  security  sequencing  service-metrics  sev1  sharding  shell  similarity  siri  sirius  skyline  slack  slides  sliding-window  sljit  solr  solving  sounds-like  speech  split-brain  sql  stings  streaming  string-matching  string-search  strings  takedowns  tech  terminal  testing  text  text-matching  tools  tracking  travel  tries  tsd  twitter  ui  unicode  unix  via:britta  via:bwalsh  via:christinebohan  via:jzawodny  via:proggit  via:stx  video  wah  web  weighting  wikipedia 

Copy this bookmark: