jm + data-structures   72

US immigration asking tech interview trivia questions now
what the absolute fuck. Celestine Omin on Twitter: "I was just asked to balance a Binary Search Tree by JFK's airport immigration. Welcome to America."
twitter  celestine-omin  us-politics  immigration  tests  interviews  bst  trees  data-structures  algorithms 
4 weeks ago by jm
tdunning/t-digest
A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.


Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.
data-structures  algorithms  java  t-digest  statistics  quantiles  percentiles  aggregation  digests  estimation  ranking 
december 2016 by jm
Highly Available Counters Using Cassandra
solid discussion of building HA counters using CRDTs and similar eventually-consistent data structures
crdts  algorithms  data-structures  cassandra  ha  counters 
september 2016 by jm
Ratas - A hierarchical timer wheel
excellent explanation and benchmarks of a timer wheel implementation
timer-wheels  timing-wheels  algorithms  c  linux  timers  data-structures 
august 2016 by jm
BTrDB: Optimizing Storage System Design for Timeseries Processing
interesting, although they punt to Ceph for storage and miss out the chance to make a CRDT
storage  trees  data-structures  timeseries  delta-delta-coding  encoding  deltas 
may 2016 by jm
Darts, Dice, and Coins
Earlier this year, I asked a question on Stack Overflow about a data structure for loaded dice. Specifically, I was interested in answering this question: "You are given an n-sided die where side i has probability pi of being rolled. What is the most efficient data structure for simulating rolls of the die?"

This data structure could be used for many purposes. For starters, you could use it to simulate rolls of a fair, six-sided die by assigning probability 1616 to each of the sides of the die, or a to simulate a fair coin by simulating a two-sided die where each side has probability 1212 of coming up. You could also use this data structure to directly simulate the total of two fair six-sided dice being thrown by having an 11-sided die (whose faces were 2, 3, 4, ..., 12), where each side was appropriately weighted with the probability that this total would show if you used two fair dice. However, you could also use this data structure to simulate loaded dice. For example, if you were playing craps with dice that you knew weren't perfectly fair, you might use the data structure to simulate many rolls of the dice to see what the optimal strategy would be. You could also consider simulating an imperfect roulette wheel in the same way.

Outside the domain of game-playing, you could also use this data structure in robotics simulations where sensors have known failure rates. For example, if a range sensor has a 95% chance of giving the right value back, a 4% chance of giving back a value that's too small, and a 1% chance of handing back a value that's too large, you could use this data structure to simulate readings from the sensor by generating a random outcome and simulating the sensor reading in that case.

The answer I received on Stack Overflow impressed me for two reasons. First, the solution pointed me at a powerful technique called the alias method that, under certain reasonable assumptions about the machine model, is capable of simulating rolls of the die in O(1)O(1) time after a simple preprocessing step. Second, and perhaps more surprisingly, this algorithm has been known for decades, but I had not once encountered it! Considering how much processing time is dedicated to simulation, I would have expected this technique to be better- known. A few quick Google searches turned up a wealth of information on the technique, but I couldn't find a single site that compiled together the intuition and explanation behind the technique.


(via Marc Brooker)
via:marcbrooker  algorithms  probability  algorithm  coding  data-structures  alias  dice  random 
april 2016 by jm
Elias gamma coding
'used most commonly when coding integers whose upper-bound cannot be determined beforehand.'
data-structures  algorithms  elias-gamma-coding  encoding  coding  numbers  integers 
april 2016 by jm
Hashed Wheel Timer
nice java impl of this efficient data structure, broken out from Project Reactor
scalability  java  timers  hashed-wheel-timers  algorithms  data-structures 
march 2016 by jm
Conversant ConcurrentQueue and Disruptor BlockingQueue
'Disruptor is the highest performing intra-thread transfer mechanism available in Java. Conversant Disruptor is the highest performing implementation of this type of ring buffer queue because it has almost no overhead and it exploits a particularly simple design.

Conversant has been using this in production since 2012 and the performance is excellent. The BlockingQueue implementation is very stable, although we continue to tune and improve it. The latest release, 1.2.4, is 100% production ready.

Although we have been working on it for a long time, we decided to open source our BlockingQueue this year to contribute something back to the community. ... its a drop in for BlockingQueue, so its a very easy test. Conversant Disruptor will crush ArrayBlockingQueue and LinkedTransferQueue for thread to thread transfers.

In our system, we noticed a 10-20% reduction in overall system load and latency when we introduced it.'
disruptor  blocking-queues  queues  queueing  data-structures  algorithms  java  conversant  concurrency  performance 
march 2016 by jm
The Bkd Tree
good explanation of this new data structure for searching multidimensional data
search  lucene  bkd-trees  searching  data-structures 
january 2016 by jm
The New InfluxDB Storage Engine: A Time Structured Merge Tree
The new engine has similarities with LSM Trees (like LevelDB and Cassandra’s underlying storage). It has a write ahead log, index files that are read only, and it occasionally performs compactions to combine index files. We’re calling it a Time Structured Merge Tree because the index files keep contiguous blocks of time and the compactions merge those blocks into larger blocks of time. Compression of the data improves as the index files are compacted. Once a shard becomes cold for writes it will be compacted into as few files as possible, which yield the best compression.
influxdb  storage  lsm-trees  leveldb  tsm-trees  data-structures  algorithms  time-series  tsd  compression 
october 2015 by jm
qp tries: smaller and faster than crit-bit tries
interesting new data structure from Tony Finch. "Some simple benchmarks say qp tries have about 1/3 less memory overhead and are about 10% faster than crit-bit tries."
crit-bit  popcount  bits  bitmaps  tries  data-structures  via:fanf  qp-tries  crit-bit-tries  hacks  memory 
october 2015 by jm
Large Java HashMap performance overview
Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove – January 2015 version
java  performance  hashmap  hashmaps  optimization  fastutil  hppc  jdk  koloboke  trove  data-structures 
september 2015 by jm
Algorithmist
The Algorithmist is a resource dedicated to anything algorithms - from the practical realm, to the theoretical realm. There are also links and explanation to problemsets.


A wiki for algorithms. Not sure if this is likely to improve on Wikipedia, which of course covers the same subject matter quite well, though
algorithms  reference  wikis  coding  data-structures 
september 2015 by jm
Mining High-Speed Data Streams: The Hoeffding Tree Algorithm
This paper proposes a decision tree learner for data streams, the Hoeffding Tree algorithm, which comes with the guarantee that the learned decision tree is asymptotically nearly identical to that of a non-incremental learner using infinitely many examples. This work constitutes a significant step in developing methodology suitable for modern ‘big data’ challenges and has initiated a lot of follow-up research. The Hoeffding Tree algorithm has been covered in various textbooks and is available in several public domain tools, including the WEKA Data Mining platform.
hoeffding-tree  algorithms  data-structures  streaming  streams  cep  decision-trees  ml  learning  papers 
august 2015 by jm
MappedBus
a Java based low latency, high throughput message bus, built on top of a memory mapped file; inspired by Java Chronicle with the main difference that it's designed to efficiently support multiple writers – enabling use cases where the order of messages produced by multiple processes are important. MappedBus can be also described as an efficient IPC mechanism which enable several Java programs to communicate by exchanging messages.
ipc  java  jvm  mappedbus  low-latency  mmap  message-bus  data-structures  queue  message-passing 
may 2015 by jm
Five Takeaways on the State of Natural Language Processing
Good overview of the state of the art in NLP nowadays. I particularly like word2vec interesting:
Embedding words as real-numbered vectors using a skip-gram, negative-sampling model (word2vec code) was mentioned in nearly every talk I attended. Either companies are using various word2vec implementations directly or they are building diffs off of the basic framework. Trained on large corpora, the vector representations encode concepts in a large dimensional space (usually 200-300 dim).


Quite similar to some tokenization approaches we experimented with in SpamAssassin, so I don't find this too surprising....
word2vec  nlp  tokenization  machine-learning  language  parsing  doc2vec  skip-grams  data-structures  feature-extraction  via:lemonodor 
may 2015 by jm
hyperlogsandwich
A probabilistic data structure for frequency/k-occurrence cardinality estimation of multisets. Sample implementation


(via Patrick McFadin)
via:patrickmcfadin  hyperloglog  cardinality  data-structures  algorithms  hyperlogsandwich  counting  estimation  lossy  multisets 
may 2015 by jm
Rob Pike's 5 rules of optimization
these are great. I've run into rule #3 ("fancy algorithms are slow when n is small, and n is usually small") several times...
twitter  rob-pike  via:igrigorik  coding  rules  laws  optimization  performance  algorithms  data-structures  aphorisms 
april 2015 by jm
ben-manes/caffeine
'Caffeine is a Java 8 based concurrency library that provides specialized data structures, such as a high performance cache.'
cache  java8  java  guava  caching  concurrency  data-structures  coding 
march 2015 by jm
"Cuckoo Filter: Practically Better Than Bloom"
'We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership
tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than
Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have
lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'
algorithms  paper  bloom-filters  cuckoo-filters  cuckoo-hashing  data-structures  false-positives  big-data  probabilistic  hashing  set-membership  approximation 
march 2015 by jm
JCTools
Java Concurrency Tools for the JVM. This project aims to offer some concurrent data structures currently missing from the JDK:

Bounded lock free queues
SPSC/MPSC/SPMC/MPMC variations for concurrent queues
Alternative interfaces for queues (experimental)
Offheap concurrent ring buffer for ITC/IPC purposes (experimental)
Executor (planned)
concurrency  lock-free  data-structures  queues  jvm  java 
january 2015 by jm
AWS re:Invent 2014 | (SPOT302) Under the Covers of AWS: Its Core Distributed Systems - YouTube
This is a really solid talk -- not surprising, alv@ is one of the speakers!
"AWS and Amazon.com operate some of the world's largest distributed systems infrastructure and applications. In our past 18 years of operating this infrastructure, we have come to realize that building such large distributed systems to meet the durability, reliability, scalability, and performance needs of AWS requires us to build our services using a few common distributed systems primitives. Examples of these primitives include a reliable method to build consensus in a distributed system, reliable and scalable key-value store, infrastructure for a transactional logging system, scalable database query layers using both NoSQL and SQL APIs, and a system for scalable and elastic compute infrastructure.

In this session, we discuss some of the solutions that we employ in building these primitives and our lessons in operating these systems. We also cover the history of some of these primitives -- DHTs, transactional logging, materialized views and various other deep distributed systems concepts; how their design evolved over time; and how we continue to scale them to AWS. "


Slides: http://www.slideshare.net/AmazonWebServices/spot302-under-the-covers-of-aws-core-distributed-systems-primitives-that-power-our-platform-aws-reinvent-2014
scale  scaling  aws  amazon  dht  logging  data-structures  distcomp  via:marc-brooker  dynamodb  s3 
november 2014 by jm
Roaring Bitmaps
Bitsets, also called bitmaps, are commonly used as fast data structures. Unfortunately, they can use too much memory. To compensate, we often use compressed bitmaps. Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be hundreds of times faster and they often offer significantly better compression.

Roaring bitmaps are used in Apache Lucene (as of version 5.0 using an independent implementation) and Apache Spark (as of version 1.2).
bitmaps  bitsets  sets  data-structures  bits  compression  lucene  spark  daniel-lemire  algorithms 
november 2014 by jm
WriterReaderPhaser
A nice new concurrency primitive from Gil Tene:
Have you ever had a need for logging or analyzing data that is actively being updated? Have you ever wanted to do that without stalling the writers (recorders) in any way? If so, then WriterReaderPhaser is for you.  I'm not talking about logging messages or text lines here.  I'm talking about data.  Data larger than one word of memory.  Data that holds actual interesting state. Data that keeps being updated, but needs to be viewed in a stable and coherent way for analysis or logging.  Data like frame buffers. Data like histograms.  Data like usage counts. Data that changes.


see also Left-Right: http://concurrencyfreaks.blogspot.ie/2013/12/left-right-concurrency-control.html
phasers  data-structures  concurrency  primitives  algorithms  performance  wait-free 
november 2014 by jm
Netty: Using as a generic library
Some cool stuff that comes along with Netty: an improved ByteBuffer, a thread-local object pool, a hashed-wheel Timer, and some nice mechanical-sympathy utils.
mechanical-sympathy  netty  java  bytebuffer  object-pools  data-structures  hashed-wheel-timer  algorithms  timers 
november 2014 by jm
RICON 2014: CRDTs
Carlos Baquero presents several operation, state-based CRDTs for use in AP systems like Voldemort and Riak
ap  cap-theorem  crdts  ricon  carlos-baquero  data-structures  distcomp 
october 2014 by jm
Cuckoo Filters
'In many networking systems, Bloom filters are used for high-speed set membership tests. They permit a small fraction of false positive answers with very good space efficiency. However, they do not permit deletion of items from the set, and previous attempts to extend “standard” Bloom filters to support deletion all degrade either space or performance. We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set member- ship tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters out-perform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'
algorithms  cs  coding  cuckoo-filters  bloom-filters  sets  data-structures 
october 2014 by jm
"Invertible Bloom Lookup Tables" [paper]
'We present a version of the Bloom filter data structure that supports not only the insertion, deletion, and lookup of key-value pairs, but also allows a complete listing of the pairs it contains with high probability, as long the number of key- value pairs is below a designed threshold. Our structure allows the number of key-value pairs to greatly exceed this threshold during normal operation. Exceeding the threshold simply temporarily prevents content listing and reduces the probability of a successful lookup. If entries are later deleted to return the structure below the threshold, everything again functions appropriately. We also show that simple variations of our structure are robust to certain standard errors, such as the deletion of a key without a corresponding insertion or the insertion of two distinct values for a key. The properties of our structure make it suitable for several applications, including database and networking applications that we highlight.'
iblt  bloom-filters  data-structures  performance  algorithms  coding  papers  probabilistic 
september 2014 by jm
OpenHFT/hftc · GitHub
This is a yet another Java collections library of primitive specializations. Java 6+. Apache 2.0 license. Currently only hash sets and hash maps are implemented.
openhft  performance  java  jvm  collections  asl  hashsets  hashmaps  data-structures 
august 2014 by jm
"Replicated abstract data types: Building blocks for collaborative applications"
cited at https://news.ycombinator.com/item?id=7737423 as 'one of my favorite papers on CRDTs and provides practical pseudocode for learning how to implement CRDTs yourself', in a discussion on cemerick's "Distributed Systems and the End of the API": http://writings.quilt.org/2014/05/12/distributed-systems-and-the-end-of-the-api/
distcomp  networking  distributed  crdts  algorithms  text  data-structures  cap 
may 2014 by jm
"A New Data Structure For Cumulative Frequency Tables"
paper by Peter M Fenwick, 1993. 'A new method (the ‘binary indexed tree’) is presented for maintaining the cumulative frequencies which are needed to support dynamic arithmetic data compression. It is based on a decomposition of the cumulative frequencies into portions which parallel the binary representation of the index of the table element (or symbol). The operations to traverse the data structure are based on the binary coding of the index. In comparison with previous methods, the binary indexed tree is faster, using more compact data and simpler code. The access time for all operations is either constant or proportional to the logarithm of the table size. In conjunction with the compact data structure, this makes the new method particularly suitable for large symbol alphabets.'

via Jakob Buchgraber, who's implementing it right now in Netty ;)
netty  frequency-tables  data-structures  algorithms  coding  binary-tree  indexing  compression  symbol-alphabets 
may 2014 by jm
Notes On Concurrent Ring Buffer Queue Mechanics
great notes from Nitsan Wakart, who's been hacking on ringbuffers a lot in JAQ
jaq  nitsanw  atomic  concurrency  data-structures  ring-buffers  queueing  queues  algorithms 
april 2014 by jm
[#1259] Add optimized queue for SCMP pattern and use it in NIO and nativ... · 6efac61 · netty/netty
Interesting -- Netty has imported an optimized ASL2-licensed MPSC queue implementation from Akka (presumably for performance raisins)
performance  optimization  open-source  mpsc  queues  data-structures  netty  akka  java 
march 2014 by jm
Sux
Some basic succinct data structures. [...] The main highlights are:
a novel, broadword-based implementation of rank/select queries for up to 264 bits that is highly competitive with known 32-bit implementations on 64-bit architectures (additional space required is 25% for ranking and 12.5%-37.5% for selection);
several Java structures using the Elias–Fano representation of monotone sequences for storing pointers, variable-length bit arrays, etc.
Java code implementing minimal perfect hashing using around 2.68 bits per element (also using some broadword ideas);
a few Java implementations of monotone minimal perfect hashing.
Sux is free software distributed under the GNU Lesser General Public License.
sux  succinct  data-structures  bits  compression  space  coding 
january 2014 by jm
Mathematical Purity in Distributed Systems: CRDTs Without Fear
Via Tony Finch. Funnily enough, the example describes Swrve: mobile game analytics, backed by a CRDT-based eventually consistent data store ;)
storage  crdts  semilattice  idempotency  commutativity  data-structures  distcomp  eventual-consistency 
january 2014 by jm
HdrHistogram by giltene
A Histogram that supports recording and analyzing sampled data value counts across a configurable integer value range with configurable value precision within the range. Value precision is expressed as the number of significant digits in the value recording, and provides control over value quantization behavior across the value range and the subsequent value resolution at any given level.
hdr  histogram  data-structures  coding  gil-tene  sampling  measuring 
october 2013 by jm
java.util.stream.SpinedBuffer
interesting new data structure, pending addition in Java 8. Basically an array of arrays which presents the API of a single List.
An ordered collection of elements. Elements can be added, but not removed. Goes through a building phase, during which elements can be added, and a traversal phase, during which elements can be traversed in order but no further modifications are possible.
spinedbuffer  data-structures  algorithms  java  jdk  jvm  java-8  arrays  lists 
october 2013 by jm
SPSC revisited part III - FastFlow + Sparse Data
holy moly. This is some heavily-optimized mechanical-sympathy Java code. By using a sparse data structure, cache-aligned fields, and wait-free low-level CAS concurrency primitives via sun.misc.Unsafe, a single-producer/single-consumer queue implementation goes pretty damn fast compared to the current state of the art
nitsanw  optimization  concurrency  java  jvm  cas  spsc  queues  data-structures  algorithms 
october 2013 by jm
_An Improved Construction For Counting Bloom Filters_
'A counting Bloom filter (CBF) generalizes a Bloom filter data structure so as to allow membership queries on a set that can be changing dynamically via insertions and deletions. As with a Bloom filter, a CBF obtains space savings by allowing false positives. We provide a simple hashing-based alternative based on d-left hashing called a d-left CBF (dlCBF). The dlCBF offers the same functionality as a CBF, but uses less space, generally saving a factor of two or more. We describe the construction of dlCBFs, provide an analysis, and demonstrate their effectiveness experimentally'
bloom-filter  data-structures  algorithms  counting  cbf  storage  false-positives  d-left-hashing  hashing 
september 2013 by jm
"Scalable Eventually Consistent Counters over Unreliable Networks" [paper, pdf]

Counters are an important abstraction in distributed computing, and
play a central role in large scale geo-replicated systems, counting events such as web page impressions or social network "likes". Classic distributed counters, strongly consistent, cannot be made both available and partition-tolerant, due to the CAP Theorem, being unsuitable to large scale scenarios.

This paper defi nes Eventually Consistent Distributed Counters (ECDC) and presents an implementation of the concept, Hando ff Counters, that is scalable and works over unreliable networks. By giving up the sequencer aspect of classic distributed counters, ECDC implementations can be made AP in the CAP design space, while retaining the essence of counting. Handoff Counters are the first CRDT (Conflict-free Replicated Data Type) based mechanism that overcomes the identity explosion problem in naive CRDTs, such as G-Counters (where state size is linear in the number of independent actors that ever incremented the counter), by managing identities towards avoiding global propagation, and garbage collecting temporary entries. The approach used in Hando ff Counters is not restricted to counters, being more generally applicable to other data types with associative and commutative operations.
pdf  papers  eventual-consistency  counters  distributed-systems  distcomp  cap-theorem  ecdc  handoff-counters  crdts  data-structures  g-counters 
august 2013 by jm
hlld
a high-performance C server which is used to expose HyperLogLog sets and operations over them to networked clients. It uses a simple ASCII protocol which is human readable, and similar to memcached.

HyperLogLog's are a relatively new sketching data structure. They are used to estimate cardinality, i.e. the unique number of items in a set. They are based on the observation that any bit in a "good" hash function is indepedenent of any other bit and that the probability of getting a string of N bits all set to the same value is 1/(2^N). There is a lot more in the math, but that is the basic intuition. What is even more incredible is that the storage required to do the counting is log(log(N)). So with a 6 bit register, we can count well into the trillions. For more information, its best to read the papers referenced at the end. TL;DR: HyperLogLogs enable you to have a set with about 1.6% variance, using 3280 bytes, and estimate sizes in the trillions.


(via:cscotta)
hyper-log-log  hlld  hll  data-structures  memcached  daemons  sketching  estimation  big-data  cardinality  algorithms  via:cscotta 
june 2013 by jm
Big Memory, Part 4
good microbenchmarking of a bunch of Java collections; Trove, fastutil, PCJ, mahout-collections, hppc
java  collections  benchmarks  performance  speed  coding  data-structures  optimization 
june 2013 by jm
fastutil
fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues with a small memory footprint and fast access and insertion; provides also big (64-bit) arrays, sets and lists, and fast, practical I/O classes for binary and text files. It is free software distributed under the Apache License 2.0. It requires Java 6 or newer.


used by Facebook (along with Apache Giraph, Netty, Unsafe) to speed up "weekend Hive jobs" to "coffee breaks". http://www.slideshare.net/nitayj/2013-0603-berlin-buzzwords
via:highscalability  facebook  giraph  optimization  java  speed  fastutil  collections  data-structures 
june 2013 by jm
“Call Me Maybe: Carly Rae Jepsen and the Perils of Network Partitions”
Aphyr's epic RICON talk, exploring distributed-database failure modes through music. and what a lot of fail there is!

Bottom line: CRDTs win
crdts  data-structures  storage  ricon  apyhr  failures  network  partitions  puns  slides 
may 2013 by jm
Hey Judy, don't make it bad
Github get good results using Judy arrays to replace a Ruby hash. However: the whole blog post is a bit dodgy to me. It feels like there are much better ways to fix the problem:

1. the big one: don't do GC-heavy activity in the front-end web servers. Split that language-classification code into a separate service. Write its results to a cache and don't re-query needlessly.
2. why isn't this benchmarked against a C/C++ hash? it's only 36000 entries, loaded once at startup. lookups against that should be blisteringly fast even with the basic data structures, and that would also be outside the Ruby heap so avoid the GC overhead. Feels like the use of a Judy array was a "because I want to" decision.
3. personally, I'd have preferred they spend time fixing their uptime problems....

See also https://news.ycombinator.com/item?id=5639013 for more kvetching.
ruby  github  gc  judy-arrays  linguist  hashes  data-structures 
may 2013 by jm
Lectures in Advanced Data Structures (6.851)
Good lecture notes on the current state of the art in data structure research.
Data structures play a central role in modern computer science. You interact with data structures even more often than with algorithms (think Google, your mail server, and even your network routers). In addition, data structures are essential building blocks in obtaining efficient algorithms. This course covers major results and current directions of research in data structures:

TIME TRAVEL We can remember the past efficiently (a technique called persistence), but in general it's difficult to change the past and see the outcomes on the present (retroactivity). So alas, Back To The Future isn't really possible.
GEOMETRY When data has more than one dimension (e.g. maps, database tables).
DYNAMIC OPTIMALITY Is there one binary search tree that's as good as all others? We still don't know, but we're close.
MEMORY HIERARCHY Real computers have multiple levels of caches. We can optimize the number of cache misses, often without even knowing the size of the cache.
HASHING Hashing is the most used data structure in computer science. And it's still an active area of research.
INTEGERS Logarithmic time is too easy. By careful analysis of the information you're dealing with, you can often reduce the operation times substantially, sometimes even to constant. We will also cover lower bounds that illustrate when this is not possible.
DYNAMIC GRAPHS A network link went down, or you just added or deleted a friend in a social network. We can still maintain essential information about the connectivity as it changes.
STRINGS Searching for phrases in giant text (think Google or DNA).
SUCCINCT Most “linear size” data structures you know are much larger than they need to be, often by an order of magnitude. Some data structures require almost no space beyond the raw data but are still fast (think heaps, but much cooler).


(via Tim Freeman)
data-structures  lectures  mit  video  data  algorithms  coding  csail  strings  integers  hashing  sorting  bst  memory 
april 2013 by jm
The Bw-Tree: A B-tree for New Hardware - Microsoft Research
The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.
bw-trees  database  paper  toread  research  algorithms  microsoft  sql  sql-server  b-trees  data-structures  storage  cache-friendly  mechanical-sympathy 
april 2013 by jm
javaewah
The bit array data structure is implemented in Java as the BitSet class. Unfortunately, this fails to scale without compression. JavaEWAH is a word-aligned compressed variant of the Java bitset class. It uses a 64-bit run-length encoding (RLE) compression scheme. We trade-off some compression for better processing speed. We also have a 32-bit version which compresses better, but is not as fast.

In general, the goal of word-aligned compression is not to achieve the best compression, but rather to improve query processing time. Hence, we try to save CPU cycles, maybe at the expense of storage. However, the EWAH scheme we implemented is always more efficient storage-wise than an uncompressed bitmap (as implemented in the BitSet class). Unlike some alternatives, javaewah does not rely on a patented scheme.
javaewah  wah  rle  compression  bitmaps  bitmap-indexes  bitset  algorithms  data-structures 
april 2013 by jm
Single Producer/Consumer lock free Queue step by step
great dissection of Martin "Disruptor" Thompson's lock-free single-producer/single-consumer queue data structure, with benchmark results showing crazy speedups. This is particularly useful since it's a data structure that can be used to provide good lock-free speedups without adopting the entire Disruptor design pattern.
disruptor  coding  java  jvm  martin-thompson  lock-free  volatile  atomic  queue  data-structures 
march 2013 by jm
Exponentially decaying lists
'log scale for lists; Decaying lists allow to manage large range of values. A decaying list grows logarithmically with the number of items. It follows that some items are dropped when other are inserted.' (via Tony Finch)
via:fanf  clojure  algorithms  decay  backoff  half-life  data-structures 
february 2013 by jm
C++ B-Tree
a new C++ template library from Google which implements an in-memory B-Tree container type, suitable for use as a drop-in replacement for std::map, set, multimap and multiset. Lower memory use, and reportedly faster due to better cache-friendliness
c++  google  data-structures  containers  b-trees  stl  map  set  open-source 
february 2013 by jm
A Non-Blocking HashTable by Dr. Cliff Click : programming
Proggit discovers the NonBlockingHashMap. This comment from Boundary's cscotta is particularly interesting: "The code is intricate and curiously-formatted, but NBHM is quite excellent. The majority of our analytics platform is backed by NBHMs updated rapidly in parallel. Cliff's a great, friendly, approachable guy; if you have any specific questions about the approaches or implementation, he may be happy to answer."
data-structures  algorithms  non-blocking  concurrency  threading  multicore  cliff-click  azul  maps  java  boundary 
january 2013 by jm
Efficient In-Memory Indexing with Generalized Prefix Trees [PDF]
'Efficient data structures for in-memory indexing gain in importance due to
(1) the exponentially increasing amount of data, (2) the growing main-memory capacity, and (3) the gap between main-memory and CPU speed. In consequence, there are
high performance demands for in-memory data structures. Such index structures are
used -- with minor changes -- as primary or secondary indices in almost every DBMS.
Typically, tree-based or hash-based structures are used, while structures based on
prefix-trees (tries) are neglected in this context. For tree-based and hash-based structures, the major disadvantages are inherently caused by the need for reorganization
and key comparisons. In contrast, the major disadvantage of trie-based structures in
terms of high memory consumption (created and accessed nodes) could be improved.
In this paper, we argue for reconsidering prefix trees as in-memory index structures
and we present the generalized trie, which is a prefix tree with variable prefix length
for indexing arbitrary data types of fixed or variable length. The variable prefix length
enables the adjustment of the trie height and its memory consumption. Further, we
introduce concepts for reducing the number of created and accessed trie levels. This
trie is order-preserving and has deterministic trie paths for keys, and hence, it does
not require any dynamic reorganization or key comparisons. Finally, the generalized
trie yields improvements compared to existing in-memory index structures, especially
for skewed data. In conclusion, the generalized trie is applicable as general-purpose
in-memory index structure in many different OLTP or hybrid (OLTP and OLAP) data
management systems that require balanced read/write performance.' (via Tony Finch)
via:fanf  prefix-trees  tries  data-structures 
january 2013 by jm
The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases [PDF]
'Main memory capacities have grown up to a point where most databases fit into RAM. For main-memory database systems, index structure performance is a critical bottleneck. Traditional in-memory data structures like balanced binary search trees are not efficient on modern hardware, because they do not optimally utilize on-CPU caches. Hash tables, also often used for main-memory indexes, are fast but only support point queries. To overcome these shortcomings, we present ART, an adaptive radix tree (trie) for efficient indexing in main memory. Its lookup performance surpasses highly tuned, read-only search trees, while supporting very efficient insertions and deletions as well. At the same time, ART is very space efficient and solves the problem of excessive worst-case space consumption, which plagues most radix trees, by adaptively choosing compact and efficient data structures for internal nodes. Even though ART’s performance is comparable to hash tables, it maintains the data in sorted order, which enables additional operations like range scan and prefix lookup.' (via Tony Finch)
via:fanf  data-structures  trees  indexing  cache-aware  tries 
january 2013 by jm
HAT-trie: A Cache-conscious Trie-based Data Structure for Strings [PDF]
'Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not however, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components. We evaluate performance using several real-world datasets and against other highperformance data structures. We show strong improvements in both time and space; in most cases approaching that of the cache-conscious hash table. Our HAT-trie is shown to be the most efficient trie-based data structure for managing variable-length strings in-memory while maintaining sort order.' (via Tony Finch)
via:fanf  data-structures  tries  cache-aware  trees 
january 2013 by jm
Efficient concurrent long set and map
An ordered set and map data structure and algorithm for long keys and values, supporting concurrent reads by multiple threads and updates by a single thread.


Some good stuff in the linked blog posts about Clojure's PersistentHashMap and PersistentVector data structures, too.
arrays  java  tries  data-structures  persistent  clojure  concurrent  set  map 
december 2012 by jm
AnandTech - The Intel SSD DC S3700: Intel's 3rd Generation Controller Analyzed
Interesting trend; Intel moved from a btree to an array-based data structure for their logical-block address indirection map, in order to reduce worst-case latencies (via Martin Thompson)
latency  intel  via:martin-thompson  optimization  speed  p99  data-structures  arrays  btrees  ssd  hardware 
november 2012 by jm
experimental CPU-cache-aware hash table implementations in Cloudera's Impala
via Todd Lipcon -- https://twitter.com/tlipcon/status/261113382642532352

'another cool piece of cloudera impala source: cpu-cache-aware hash table implementations by @jackowayed'. 'L1-sized hash table that hopes to use cache well. Each bucket is a chunk list of tuples. Each chunk is a cache line.'
hashing  hash-tables  data-structures  performance  c++  l1  cache  cpu 
october 2012 by jm
ElementCostInDataStructures
"The cost per element in major data structures offered by Java and Guava (r11)]." A very useful reference!

Ever wondered what's the cost of adding each entry to a HashMap? Or one new element in a TreeSet? Here are the answers: the cost per-entry for each well-known structure in Java and Guava. You can use this to estimate the cost of a structure, like this: if the per-entry cost of a structure is 32 bytes, and your structure contains 1024 elements, the structure's footprint will be around 32 kilobytes. Note that non-tree mutable structures are amortized (adding an element might trigger a resize, and be expensive, otherwise it would be cheap), making the measurement of the "average per element cost" measurement hard, but you can expect that the real answers are close to what is reported below.
java  coding  guava  reference  memory  cost  performance  data-structures 
october 2012 by jm
Cliff Click's 2008 JavaOne talk about the NonBlockingHashTable
I'm a bit late to this data structure -- highly scalable, nearly lock-free, benchmarks very well (except with the G1 GC): http://edwwang.com/blog/2012/02/10/concurrent-hashmap-benchmark/ .

Having said that, it doesn't cope well with frequently-changing unique keys: http://sourceforge.net/tracker/?func=detail&aid=3563980&group_id=194172&atid=948362 .

More background at: http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable and http://www.azulsystems.com/blog/cliff/2007-04-01-non-blocking-hashtable-part-2

This was used in Cassandra for a while, although I think the above bug may have caused its removal?
nonblockinghashtable  data-structures  hashmap  concurrency  scaling  java  jvm 
october 2012 by jm
SnapTree benchmarks
nice concurrent Map data structure for the JVM; beats out ConcurrentHashMap, ConcurrentLinkedHashMap from guava, ConcurrentSkipListMap under both CMS and G1 garbage collectors.
concurrency  benchmarks  hashmap  map  data-structures  java  jvm  snaptree 
september 2012 by jm
Avoiding Hash Lookups in a Ruby Implementation
'If I were to sum up the past 6 years I've spent optimizing JRuby it would be with the following phrase: Get Rid Of Hash Lookups.'

This has been a particular theme of some recent optimization hacks I've been working on. Hashes may be O(1) to read, on average, but that doesn't necessarily mean they're the right tool for performance...

(via Declan McGrath)
via:declanmcgrath  hash  optimization  ruby  performance  jruby  hashing  data-structures  big-o  optimisation 
september 2012 by jm
Open Data Structures
A free-as-in-speech as well as -beer textbook of data structures, covering a great range, including some I hadn't heard of before. Here's the full list: ArrayStack, FastArrayStack, ArrayQueue, ArrayDeque, DualArrayDeque, RootishArrayStack, SLList, DLList,
SEList, SkiplistSSet, SkiplistList, ChainedHashTable, LinearHashTable, BinaryTree, BinarySearchTree, Treap, ScapegoatTree, RedBlackTree, BinaryHeap, MeldableHeap, AdjacencyMatrix, AdjacencyLists, BinaryTrie, XFastTrie, and YFastTrie
algorithms  books  data-structures  computer-science  coding  tries  skiplists  arrays  queues  heap  trees  graphs  hashtables 
may 2012 by jm
Google Guava BloomFIlter
neat, Guava now has a builtin Bloom filter implementation using the murmur hash. that'll potentially save a little hassle in the future
guava  coding  java  bloom-filters  data-structures  sets 
march 2012 by jm
Golomb-coded sets
'a probabilistic data structure conceptually similar to a Bloom filter, but with a more compact in-memory representation, and a slower query time.' could come in handy
gcs  bloom-filters  probabilistic  data-structures  memory  algorithms 
september 2011 by jm

related tags

aggregation  akka  algorithm  algorithms  alias  amazon  ap  aphorisms  approximation  apyhr  arrays  asl  atomic  aws  azul  b-trees  backoff  benchmarks  big-data  big-o  binary-tree  bitmap-indexes  bitmaps  bits  bitset  bitsets  bkd-trees  blocking-queues  bloom-cookies  bloom-filter  bloom-filters  books  boundary  bst  btrees  bw-trees  bytebuffer  c  c++  cache  cache-aware  cache-friendly  caching  cap  cap-theorem  cardinality  carlos-baquero  cas  cassandra  cbf  celestine-omin  cep  cliff-click  clojure  coding  collections  commutativity  compression  computer-science  concurrency  concurrent  containers  conversant  cookies  cost  count-min  counters  counting  cpu  crdts  crit-bit  crit-bit-tries  cs  csail  cuckoo-filters  cuckoo-hashing  d-left-hashing  daemons  daniel-lemire  data  data-structures  database  decay  decision-trees  delta-delta-coding  deltas  dht  dice  digests  disruptor  distcomp  distributed  distributed-systems  doc2vec  dynamodb  ecdc  elias-gamma-coding  encoding  estimation  eventual-consistency  facebook  failures  false-positives  fastutil  feature-extraction  frequency  frequency-tables  g-counters  gc  gcs  geek  gil-tene  giraph  github  google  graphs  guava  ha  hacker-news  hacks  half-life  handoff-counters  hardware  hash  hash-tables  hashed-wheel-timer  hashed-wheel-timers  hashes  hashing  hashmap  hashmaps  hashsets  hashtables  hdr  heap  histogram  hll  hlld  hoeffding-tree  hppc  http  hyper-log-log  hyperloglog  hyperlogsandwich  iblt  idempotency  immigration  indexing  influxdb  integers  intel  interviews  ipc  jaq  java  java-8  java8  javaewah  jdk  jruby  judy-arrays  jvm  koloboke  l1  language  latency  laws  learning  lectures  leveldb  linguist  linux  lists  lock-free  logging  lossy  low-latency  lsm-trees  lucene  machine-learning  map  mappedbus  maps  martin-thompson  measuring  mechanical-sympathy  memcached  memory  message-bus  message-passing  microsoft  minhash  mit  ml  mmap  mpsc  multicore  multisets  netty  network  networking  nitsanw  nlp  non-blocking  nonblockinghashtable  numbers  object-pools  open-source  openhft  optimisation  optimization  p99  paper  papers  parsing  partitions  pdf  percentiles  performance  persistent  personalization  phasers  popcount  prefix-trees  primitives  privacy  probabilistic  probability  programming  puns  qp-tries  quantiles  queue  queueing  queues  random  ranking  redis  reference  research  ricon  ring-buffers  rle  rob-pike  ruby  rules  s3  sampling  scalability  scale  scaling  search  searching  semilattice  set  set-membership  sets  sketches  sketching  skip-grams  skiplists  slides  snaptree  sorting  space  spark  speed  spinedbuffer  spsc  sql  sql-server  ssd  statistics  stl  storage  streaming  streams  strings  succinct  succinct-encoding  sux  symbol-alphabets  t-digest  tests  text  threading  time-series  timer-wheels  timers  timeseries  timing-wheels  tokenization  toread  trees  tries  trove  tsd  tsm-trees  twitter  us-politics  user-tracking  via:cscotta  via:declanmcgrath  via:fanf  via:highscalability  via:igrigorik  via:lemonodor  via:marc-brooker  via:marcbrooker  via:martin-thompson  via:patrickmcfadin  video  volatile  wah  wait-free  wikis  word2vec 

Copy this bookmark:



description:


tags: