jm + henry-robinson   2

The Case for Learned Index Structures
'Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.'

Excellent follow-up thread from Henry Robinson:

'The fact that the learned representation is more compact is very neat. But also it's not really a surprise that, given the entire dataset, we can construct a more compact function than a B-tree which is *designed* to support efficient updates.' [...] 'given that the model performs best when trained on the whole data set - I strongly doubt B-trees are the best we can do with the current state-of-the art.'
data-structures  ml  google  b-trees  storage  indexes  deep-learning  henry-robinson 
5 weeks ago by jm
Henry Robinson on testing and fault discovery in distributed systems

'Let's talk about finding bugs in distributed systems for a bit.
These chaos monkey-style fault testing systems are all well and good, but by being application independent they're a very blunt instrument.
Particularly they make it hard to search the fault space for bugs in a directed manner, because they don't 'know' what the system is doing.
Application-aware scripting of faults in a dist. systems seems to be rarely used, but allows you to directly stress problem areas.
For example, if a bug manifests itself only when one RPC returns after some timeout, hard to narrow that down with iptables manipulation.
But allow a script to hook into RPC invocations (and other trace points, like DTrace's probes), and you can script very specific faults.
That way you can simulate cross-system integration failures, *and* write reproducible tests for the bugs they expose!
Anyhow, I've been doing this in Impala, and it's been very helpful. Haven't seen much evidence elsewhere.'
henry-robinson  testing  fault-discovery  rpc  dtrace  tracing  distributed-systems  timeouts  chaos-monkey  impala 
september 2015 by jm

Copy this bookmark: