jm + key-value-stores   6

MemC3: Compact and concurrent Memcache with dumber caching and smarter hashing
An improved hashing algorithm called optimistic cuckoo hashing, and a CLOCK-based eviction algorithm that works in tandem with it. They are evaluated in the context of Memcached, where combined they give up to a 30% memory usage reduction and up to a 3x improvement in queries per second as compared to the default Memcached implementation on read-heavy workloads with small objects (as is typified by Facebook workloads).
memcached  performance  key-value-stores  storage  databases  cuckoo-hashing  algorithms  concurrency  caching  cache-eviction  memory  throughput 
november 2016 by jm
a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Cockroach nodes are symmetric; a design goal is one binary with minimal configuration and no required auxiliary services.

Cockroach implements a single, monolithic sorted map from key to value where both keys and values are byte strings (not unicode). Cockroach scales linearly (theoretically up to 4 exabytes (4E) of logical data). The map is composed of one or more ranges and each range is backed by data stored in RocksDB (a variant of LevelDB), and is replicated to a total of three or more cockroach servers. Ranges are defined by start and end keys. Ranges are merged and split to maintain total byte size within a globally configurable min/max size interval. Range sizes default to target 64M in order to facilitate quick splits and merges and to distribute load at hotspots within a key range. Range replicas are intended to be located in disparate datacenters for survivability (e.g. { US-East, US-West, Japan }, { Ireland, US-East, US-West}, { Ireland, US-East, US-West, Japan, Australia }).

Single mutations to ranges are mediated via an instance of a distributed consensus algorithm to ensure consistency. We’ve chosen to use the Raft consensus algorithm. All consensus state is stored in RocksDB.

A single logical mutation may affect multiple key/value pairs. Logical mutations have ACID transactional semantics. If all keys affected by a logical mutation fall within the same range, atomicity and consistency are guaranteed by Raft; this is the fast commit path. Otherwise, a non-locking distributed commit protocol is employed between affected ranges.

Cockroach provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes--both from an historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. Cockroach implements a limited form of linearalizability, providing ordering for any observer or chain of observers.

This looks nifty. One to watch.
cockroachdb  databases  storage  georeplication  raft  consensus  acid  go  key-value-stores  rocksdb 
may 2014 by jm
MICA: A Holistic Approach To Fast In-Memory Key-Value Storage [paper]
Very interesting new approach to building a scalable in-memory K/V store. As Rajiv Kurian notes on the mechanical-sympathy list:

'The basic idea is that each core is responsible for a portion of the key-space and requests are forwarded to the right core, avoiding multiple-writer scenarios. This is opposed to designs like memcache which uses locks and shared memory.

Some of the things I found interesting: The single writer design is taken to an extreme. Clients assist the partitioning of requests, by calculating hashes before submitting GET requests. It uses Intel DPDK instead of sockets to forward packets to the right core, without processing the packet on any core. Each core is paired with a dedicated RX/TX queue. The design for a lossy cache is simple but interesting. It does things like replacing a hash slot (instead of chaining) etc. to take advantage of the lossy nature of caches. There is a lossless design too. A bunch of tricks to optimize for memory performance. This includes pre-allocation, design of the hash indexes, prefetching tricks etc. There are some other concurrency tricks that were interesting. Handling dangling pointers was one of them.'

Source code here:
mica  in-memory  memory  ram  key-value-stores  storage  smp  dpdk  multicore  memcached  concurrency 
april 2014 by jm
' A persistent key-value store for fast storage environments', ie. BerkeleyDB/LevelDB competitor, from Facebook.
RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.

We benchmarked LevelDB and found that it was unsuitable for our server workloads. Thebenchmark results look awesome at first sight, but we quickly realized that those results were for a database whose size was smaller than the size of RAM on the test machine - where the entire database could fit in the OS page cache. When we performed the same benchmarks on a database that was at least 5 times larger than main memory, the performance results were dismal.

By contrast, we've published the RocksDB benchmark results for server side workloads on Flash. We also measured the performance of LevelDB on these server-workload benchmarks and found that RocksDB solidly outperforms LevelDB for these IO bound workloads. We found that LevelDB's single-threaded compaction process was insufficient to drive server workloads. We saw frequent write-stalls with LevelDB that caused 99-percentile latency to be tremendously large. We found that mmap-ing a file into the OS cache introduced performance bottlenecks for reads. We could not make LevelDB consume all the IOs offered by the underlying Flash storage.

Lots of good discussion at too.
flash  ssd  rocksdb  databases  storage  nosql  facebook  bdb  disk  key-value-stores  lsm  leveldb 
november 2013 by jm
HyperLevelDB: A High-Performance LevelDB Fork
'HyperLevelDB improves on LevelDB in two key ways:
Improved parallelism: HyperLevelDB uses more fine-grained locking internally to provide higher throughput for multiple writer threads.
Improved compaction: HyperLevelDB uses a different method of compaction that achieves higher throughput for write-heavy workloads, even as the database grows.'
leveldb  storage  key-value-stores  persistence  unix  libraries  open-source 
june 2013 by jm
Riakking Complex Data Types
interesting details about Riak's support for secondary indexes. Not quite SQL, but still more powerful than plain old K/V storage (via dehora)
via:dehora  riak  indexes  storage  nosql  key-value-stores  2i  range-queries 
march 2013 by jm

Copy this bookmark: