jm + sharding   13

Slicer: Auto-sharding for datacenter applications
Paper from Google describing one of their internal building block services:
A general purpose sharding service. I normally think of sharding as something that happens within a (typically data) service, not as a general purpose infrastructure service. What exactly is Slicer then? It has two key components: a data plane that acts as an affinity-aware load balancer, with affinity managed based on application-specified keys; and a control plane that monitors load and instructs applications processes as to which keys they should be serving at any one point in time.  In this way, the decisions regarding how to balance keys across application instances can be outsourced to the Slicer service rather than building this logic over and over again for each individual back-end service. Slicer is focused exclusively on the problem of balancing load across a given set of  backend tasks, other systems are responsible for adding and removing tasks.

google  sharding  slicer  architecture  papers 
december 2016 by jm
Building a complete Tweet index
Twitter's new massive-scale twitter search backend. Sharding galore
architecture  search  twitter  sharding  earlybird 
november 2014 by jm
Jump Consistent Hash: A Fast, Minimal Memory, Consistent Hash Algorithm
'a fast, minimal memory, consistent hash algorithm that can be expressed in about 5 lines of code. In comparison to the algorithm of Karger et al., jump consistent hash requires no storage, is faster, and does a better job of evenly dividing the key space among the buckets and of evenly dividing the workload when the number of buckets changes. Its main limitation is that the buckets must be numbered sequentially, which makes it more suitable for data storage applications than for distributed web caching.'

Implemented in Guava. This is also noteworthy:

'Google has not applied for patent protection for this algorithm, and, as of this writing, has no plans to. Rather, it wishes to contribute this algorithm to the community.'
hashing  consistent-hashing  google  guava  memory  algorithms  sharding 
june 2014 by jm
Shuffle Sharding
Colm MacCarthaigh writes about a simple sharding/load-balancing algorithm which uses randomized instance selection and optional additional compartmentalization. See also: continuous hashing, and
hashing  load-balancing  sharding  partitions  dist-sys  distcomp  architecture  coding 
april 2014 by jm
Scalable Atomic Visibility with RAMP Transactions
Great new distcomp protocol work from Peter Bailis et al:
We’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are. [...]

How they work: RAMP transactions allow readers and writers to proceed concurrently. Operations race, but readers autonomously detect the races and repair any non-atomic reads. The write protocol ensures readers never stall waiting for writes to arrive.

Why they scale: Clients can’t cause other clients to stall (via synchronization independence) and clients only have to contact the servers responsible for items in their transactions (via partition independence). As a consequence, there’s no mutual exclusion or synchronous coordination across servers.

The end result: RAMP transactions outperform existing approaches across a variety of workloads, and, for a workload of 95% reads, RAMP transactions scale to over 7 million ops/second on 100 servers at less than 5% overhead.
scale  synchronization  databases  distcomp  distributed  ramp  transactions  scalability  peter-bailis  protocols  sharding  concurrency  atomic  partitions 
april 2014 by jm
Building a Balanced Universe - EVE Community
Good blog post about EVE's algorithm to load-balance a 3D map of star systems
eve  eve-online  algorithms  3d  space  load-balancing  sharding  games 
december 2013 by jm
Amazon Route 53 Infima
Colm McCarthaigh has open sourced Infima, 'a library for managing service-level fault isolation using Amazon Route 53'.
Infima provides a Lattice container framework that allows you to categorize each endpoint along one or more fault-isolation dimensions such as availability-zone, software implementation, underlying datastore or any other common point of dependency endpoints may share.

Infima also introduces a new ShuffleShard sharding type that can exponentially increase the endpoint-level isolation between customer/object access patterns or any other identifier you choose to shard on.

Both Infima Lattices and ShuffleShards can also be automatically expressed in Route 53 DNS failover configurations using AnswerSet and RubberTree.
infima  colmmacc  dns  route-53  fault-tolerance  failover  multi-az  sharding  service-discovery 
november 2013 by jm
Pinterest's follower graph store, built on Redis
This is a good, high-availability Redis configuration; sharded by userid across 8192 shards, with a Redis master/slave pair of instances for each set of N shards. I like their use of two redundancy systems -- hot slave and backup snapshots:
We run our cluster in a Redis master-slave configuration, and the slaves act as hot backups. Upon a master failure, we failover the slave as the new master and either bring up a new slave or reuse the old master as the new slave. We rely on ZooKeeper to make this as quick as possible.

Each master Redis instance (and slave instance) is configured to write to AOF on Amazon EBS. This ensures that if the Redis instances terminate unexpectedly then the loss of data is limited to 1 second of updates. The slave Redis instances also perform BGsave hourly which is then loaded to a more permanent store (Amazon S3). This copy is also used by Map Reduce jobs for analytics.

As a production system, we need many failure modes to guard ourselves. As mentioned, if the master host is down, we will manually failover to slave. If a single master Redis instance reboots, monit restart restores from AOF, implying a 1 second window of data loss on the shards on that instance. If the slave host goes down, we bring up a replacement. If a single slave Redis instance goes down, we rely on monit to restart using the AOF data. Because we may encounter AOF or BGsave file corruption, we BGSave and copy hourly backups to S3. Note that large file sizes can cause BGsave induced delays but in our cluster this is mitigated by smaller Redis data due to the sharding scheme.
graph  redis  architecture  ha  high-availability  design  redundancy  sharding 
july 2013 by jm
High Scalability - Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years
wow, Pinterest have a pretty hardcore architecture. Sharding to the max. This is scary stuff for me:
a [Cassandra-style] Cluster Management Algorithm is a SPOF. If there’s a bug it impacts every node. This took them down 4 times.

yeah, so, eek ;)
clustering  sharding  architecture  aws  scalability  scaling  pinterest  via:matt-sergeant  redis  mysql  memcached 
april 2013 by jm
Foursquare MongoDB outage post mortem
MongoDB was set up to write to RAM if possible, omitting immediate writes to disk -- but then the db size exceeded RAM size, the disk was hit, imposing a massive slowdown and creating a huge backlog immediately, bringing the site down (via Nelson)
via:nelson  mongodb  sharding  nosql  ouch  outage  foursquare  sysadmin  ops  from delicious
october 2010 by jm

Copy this bookmark: