jm + cap   28

Don't Settle For Eventual Consistency
Quite an argument. Not sure I agree, but worth a bookmark anyway...
With an AP system, you are giving up consistency, and not really gaining anything in terms of effective availability, the type of availability you really care about.  Some might think you can regain strong consistency in an AP system by using strict quorums (where the number of nodes written + number of nodes read > number of replicas).  Cassandra calls this “tunable consistency”.  However, Kleppmann has shown that even with strict quorums, inconsistencies can result.10  So when choosing (algorithmic) availability over consistency, you are giving up consistency for not much in return, as well as gaining complexity in your clients when they have to deal with inconsistencies.
cap-theorem  databases  storage  cap  consistency  cp  ap  eventual-consistency 
21 days ago by jm
Jepsen: RethinkDB 2.1.5
A good review of RethinkDB! Hopefully not just because this test is contract work on behalf of the RethinkDB team ;)
I’ve run hundreds of test against RethinkDB at majority/majority, at various timescales, request rates, concurrencies, and with different types of failures. Consistent with the documentation, I have never found a linearization failure with these settings. If you use hard durability, majority writes, and majority reads, single-document ops in RethinkDB appear safe.
rethinkdb  databases  stores  storage  ops  availability  cap  jepsen  tests  replication 
january 2016 by jm
Existential Consistency: Measuring and Understanding Consistency at Facebook
The metric is termed φ(P)-consistency, and is actually very simple. A read for the same data is sent to all replicas in P, and φ(P)-consistency is defined as the frequency with which that read returns the same result from all replicas. φ(G)-consistency applies this metric globally, and φ(R)-consistency applies it within a region (cluster). Facebook have been tracking this metric in production since 2012.
facebook  eventual-consistency  consistency  metrics  papers  cap  distributed-computing 
october 2015 by jm
Call me Maybe: Chronos
Chronos (the Mesos distributed scheduler) comes out looking pretty crappy here
aphyr  mesos  chronos  cron  scheduling  outages  ops  jepsen  testing  partitions  cap 
august 2015 by jm
Please stop calling databases CP or AP
In his excellent blog post [...] Jeff Hodges recommends that you use the CAP theorem to critique systems. A lot of people have taken that advice to heart, describing their systems as “CP” (consistent but not available under network partitions), “AP” (available but not consistent under network partitions), or sometimes “CA” (meaning “I still haven’t read Coda’s post from almost 5 years ago”).

I agree with all of Jeff’s other points, but with regard to the CAP theorem, I must disagree. The CAP theorem is too simplistic and too widely misunderstood to be of much use for characterizing systems. Therefore I ask that we retire all references to the CAP theorem, stop talking about the CAP theorem, and put the poor thing to rest. Instead, we should use more precise terminology to reason about our trade-offs.
cap  databases  storage  distcomp  ca  ap  cp  zookeeper  consistency  reliability  networking 
may 2015 by jm
Call me maybe: Aerospike
'Aerospike offers phenomenal latencies and throughput -- but in terms of data safety, its strongest guarantees are similar to Cassandra or Riak in Last-Write-Wins mode. It may be a safe store for immutable data, but updates to a record can be silently discarded in the event of network disruption. Because Aerospike’s timeouts are so aggressive–on the order of milliseconds -- even small network hiccups are sufficient to trigger data loss. If you are an Aerospike user, you should not expect “immediate”, “read-committed”, or “ACID consistency”; their marketing material quietly assumes you have a magical network, and I assure you this is not the case. It’s certainly not true in cloud environments, and even well-managed physical datacenters can experience horrible network failures.'
aerospike  outages  cap  testing  jepsen  aphyr  databases  storage  reliability 
may 2015 by jm
Call me maybe: Elasticsearch 1.5.0
tl;dr: Elasticsearch still hoses data integrity on partition, badly
elasticsearch  reliability  data  storage  safety  jepsen  testing  aphyr  partition  network-partitions  cap 
may 2015 by jm
Why You Shouldn’t Use ZooKeeper for Service Discovery
In CAP terms, ZooKeeper is CP, meaning that it’s consistent in the face of partitions, not available. For many things that ZooKeeper does, this is a necessary trade-off. Since ZooKeeper is first and foremost a coordination service, having an eventually consistent design (being AP) would be a horrible design decision. Its core consensus algorithm, Zab, is therefore all about consistency. For coordination, that’s great. But for service discovery it’s better to have information that may contain falsehoods than to have no information at all. It is much better to know what servers were available for a given service five minutes ago than to have no idea what things looked like due to a transient network partition. The guarantees that ZooKeeper makes for coordination are the wrong ones for service discovery, and it hurts you to have them.

Yes! I've been saying this for months -- good to see others concurring.
architecture  zookeeper  eureka  outages  network-partitions  service-discovery  cap  partitions 
december 2014 by jm
Zookeeper: not so great as a highly-available service registry
Turns out ZK isn't a good choice as a service discovery system, if you want to be able to use that service discovery system while partitioned from the rest of the ZK cluster:
I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances.  This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones.  What I saw was that the two other instances noticed the first server “going away”, but they continued to function as they still saw a majority (66%).  More interestingly the first instance noticed the other two servers “going away”, dropping the ensemble availability to 33%.  This caused the first server to stop serving requests to clients (not only writes, but also reads).

So: within that offline AZ, service discovery *reads* (as well as writes) stopped working due to a lack of ZK quorum. This is quite a feasible outage scenario for EC2, by the way, since (at least when I was working there) the network links between AZs, and the links with the external internet, were not 100% overlapping.

In other words, if you want a highly-available service discovery system in the fact of network partitions, you want an AP service discovery system, rather than a CP one -- and ZK is a CP system.

Another risk, noted on the Netflix Eureka mailing list at :

ZooKeeper, while tolerant against single node failures, doesn't react well to long partitioning events. For us, it's vastly more important that we maintain an available registry than a necessarily consistent registry. If us-east-1d sees 23 nodes, and us-east-1c sees 22 nodes for a little bit, that's OK with us.

I guess this means that a long partition can trigger SESSION_EXPIRED state, resulting in ZK client libraries requiring a restart/reconnect to fix. I'm not entirely clear what happens to the ZK cluster itself in this scenario though.

Finally, Pinterest ran into other issues relying on ZK for service discovery and registration, described at ; sounds like this was mainly around load and the "thundering herd" overload problem. Their workaround was to decouple ZK availability from their services' availability, by building a Smartstack-style sidecar daemon on each host which tracked/cached ZK data.
zookeeper  service-discovery  ops  ha  cap  ap  cp  service-registry  availability  ec2  aws  network  partitions  eureka  smartstack  pinterest 
november 2014 by jm
[KAFKA-1555] provide strong consistency with reasonable availability
Major improvements for Kafka consistency coming in 0.8.2; replication to multiple in-sync replicas, controlled by a new "min.isr" setting
kafka  replication  cap  consistency  streams 
october 2014 by jm
Mnesia and CAP
A common “trick” is to claim:

'We assume network partitions can’t happen. Therefore, our system is CA according to the CAP theorem.'

This is a nice little twist. By asserting network partitions cannot happen, you just made your system into one which is not distributed. Hence the CAP theorem doesn’t even apply to your case and anything can happen. Your system may be linearizable. Your system might have good availability. But the CAP theorem doesn’t apply. [...]
In fact, any well-behaved system will be “CA” as long as there are no partitions. This makes the statement of a system being “CA” very weak, because it doesn’t put honesty first. I tries to avoid the hard question, which is how the system operates under failure. By assuming no network partitions, you assume perfect information knowledge in a distributed system. This isn’t the physical reality.
cap  erlang  mnesia  databases  storage  distcomp  reliability  ca  postgres  partitions 
october 2014 by jm
"Perspectives On The CAP Theorem" [pdf]
"We cannot achieve [CAP theorem] consistency and availability in a partition-prone network."
papers  cap  distcomp  cap-theorem  consistency  availability  partitions  network  reliability 
september 2014 by jm
"Replicated abstract data types: Building blocks for collaborative applications"
cited at as 'one of my favorite papers on CRDTs and provides practical pseudocode for learning how to implement CRDTs yourself', in a discussion on cemerick's "Distributed Systems and the End of the API":
distcomp  networking  distributed  crdts  algorithms  text  data-structures  cap 
may 2014 by jm
ZooKeeper Resilience at Pinterest
essentially decoupling the client services from ZK using a local daemon on each client host; very similar to Airbnb's Smartstack. This is a bit of an indictment of ZK's usability though
ops  architecture  clustering  network  partitions  cap  reliability  smartstack  airbnb  pinterest  zookeeper 
march 2014 by jm
'Testing applications under slow or flaky network conditions can be difficult and time consuming. Blockade aims to make that easier. A config file defines a number of docker containers and a command line tool makes introducing controlled network problems simple.'

Open-source release from Dell's Cloud Manager team (ex-Enstratius), inspired by aphyr's Jepsen. Simulates packet loss using "tc netem", so no ability to e.g. drop packets on certain flows or certain ports. Still, looks very usable -- great stuff.
testing  docker  networking  distributed  distcomp  enstratius  jepsen  network  outages  partitions  cap  via:lusis 
february 2014 by jm
Beating the CAP Theorem Checklist
'Your ( ) tweet ( ) blog post ( ) marketing material ( ) online comment
advocates a way to beat the CAP theorem. Your idea will not work. Here is why
it won't work:'

lovely stuff, via Bill De hOra
via:dehora  funny  cap  cs  distributed-systems  distcomp  networking  partitions  state  checklists 
august 2013 by jm
The CAP FAQ by henryr
No subject appears to be more controversial to distributed systems engineers than the oft-quoted, oft-misunderstood CAP theorem. The purpose of this FAQ is to explain what is known about CAP, so as to help those new to the theorem get up to speed quickly, and to settle some common misconceptions or points of disagreement.
database  distributed  nosql  cap  consistency  cap-theorem  faqs 
june 2013 by jm
The network is reliable
Aphyr and Peter Bailis collect an authoritative list of known network partition and outage cases from published post-mortem data:

This post is meant as a reference point -- to illustrate that, according to a wide range of accounts, partitions occur in many real-world environments. Processes, servers, NICs, switches, local and wide area networks can all fail, and the resulting economic consequences are real. Network outages can suddenly arise in systems that are stable for months at a time, during routine upgrades, or as a result of emergency maintenance. The consequences of these outages range from increased latency and temporary unavailability to inconsistency, corruption, and data loss. Split-brain is not an academic concern: it happens to all kinds of systems -- sometimes for days on end. Partitions deserve serious consideration.

I honestly cannot understand people who didn't think this was the case. 3 years reading (and occasionally auto-cutting) Amazon's network-outage tickets as part of AWS network monitoring will do that to you I guess ;)
networking  outages  partition  cap  failure  fault-tolerance 
june 2013 by jm
Call me maybe: Carly Rae Jepsen and the perils of network partitions
Kyle "aphyr" Kingsbury expands on his slides demonstrating the real-world failure scenarios that arise during some kinds of partitions (specifically, the TCP-hang, no clear routing failure, network partition scenario). Great set of blog posts clarifying CAP
distributed  network  databases  cap  nosql  redis  mongodb  postgresql  riak  crdt  aphyr 
may 2013 by jm
CAP Confusion: Problems with ‘partition tolerance’
Another good clarification about CAP which resurfaced during last week's discussion:
So what causes partitions? Two things, really. The first is obvious – a network failure, for example due to a faulty switch, can cause the network to partition. The other is less obvious, but fits with the definition [...]: machine failures, either hard or soft. In an asynchronous network, i.e. one where processing a message could take unbounded time, it is impossible to distinguish between machine failures and lost messages. Therefore a single machine failure partitions it from the rest of the network. A correlated failure of several machines partitions them all from the network. Not being able to receive a message is the same as the network not delivering it. In the face of sufficiently many machine failures, it is still impossible to maintain availability and consistency, not because two writes may go to separate partitions, but because the failure of an entire ‘quorum’ of servers may render some recent writes unreadable.

(sorry, catching up on old interesting things posted last week...)
failure  scalability  network  partitions  cap  quorum  distributed-databases  fault-tolerance 
may 2013 by jm
Alex Feinberg's response to Damien Katz' anti-Dynamoish/pro-Couchbase blog post
Insightful response, worth bookmarking. (the original post is at ).
while you are saving on read traffic (online reads only go to the master), you are now decreasing availability (contrary to your stated goal), and increasing system complexity.
You also do hurt performance by requiring all writes and reads to be serialized through a single node: unless you plan to have a leader election whenever the node fails to meet a read SLA (which is going to result a disaster -- I am speaking from personal experience), you will have to accept that you're bottlenecked by a single node. With a Dynamo-style quorum (for either reads or writes), a single straggler will not reduce whole-cluster latency.
The core point of Dynamo is low latency, availability and handling of all kinds of partitions: whether clean partitions (long term single node failures), transient failures (garbage collection pauses, slow disks, network blips, etc...), or even more complex dependent failures.
The reality, of course, is that availability is neither the sole, nor the principal concern of every system. It's perfect fine to trade off availability for other goals -- you just need to be aware of that trade off.
cap  distributed-databases  databases  quorum  availability  scalability  damien-katz  alex-feinberg  partitions  network  dynamo  riak  voldemort  couchbase 
may 2013 by jm
Riak, CAP, and eventual consistency
Good (albeit draft) write-up of the implications of CAP, allow_mult, and last_write_wins conflict-resolution policies in Riak:
As Brewer's CAP theorem established, distributed systems have to make hard choices. Network partition is inevitable. Hardware failure is inevitable. When a partition occurs, a well-behaved system must choose its behavior from a spectrum of options ranging from "stop accepting any writes until the outage is resolved" (thus maintaining absolute consistency) to "allow any writes and worry about consistency later" (to maximize availability). Riak leans toward the availability end of the spectrum, but allows the operator and even the developer to tune read and write requests to better meet the business needs for any given set of data.
riak  cap  eventual-consistency  distcomp  distributed-systems  partition  last-write-wins  voldemort  allow_mult 
april 2013 by jm
Eventual Consistency Today: Limitations, Extensions, and Beyond - ACM Queue
Good overview of the current state of eventually-consistent data store research, covering CALM and CRDTs, from Peter Bailis and Ali Ghodsi
eventual-consistency  data  storage  horizontal-scaling  research  distcomp  distributed-systems  via:martin-thompson  crdts  calm  acid  cap 
april 2013 by jm
Notes on Distributed Systems for Young Bloods
'Below is a list of some lessons I’ve learned as a distributed systems engineer that are worth being told to a new engineer. Some are subtle, and some are surprising, but none are controversial. This list is for the new distributed systems engineer to guide their thinking about the field they are taking on. It’s not comprehensive, but it’s a good beginning.' This is a pretty nice list, a little over-stated, but that's the format. I particularly like the following: 'Exploit data-locality'; 'Learn to estimate your capacity'; 'Metrics are the only way to get your job done'; 'Use percentiles, not averages'; 'Extract services'.
systems  distributed  distcomp  cap  metrics  coding  guidelines  architecture  backpressure  design  twitter 
january 2013 by jm
Spanner: Google's Globally-Distributed Database [PDF]

Abstract: Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

To appear in:
OSDI'12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October, 2012.
database  distributed  google  papers  toread  pdf  scalability  distcomp  transactions  cap  consistency 
september 2012 by jm
Ask For Forgiveness Programming - Or How We'll Program 1000 Cores
Nifty concept from IBM Research's David Ungar -- "race-and-repair". Simply put, allow lock-free lossy/inconsistent calculation, and backfill later, using concepts like "freshener" threads, to reconcile inconsistencies. This is a familiar concept in distributed computing nowadays thanks to CAP, but I hadn't heard it being applied to single-host multicore parallel programming before -- I can already think of an application in our codebase...
race-and-repair  concurrency  coding  ibm  parallelism  parallel  david-ungar  cap  multicore 
april 2012 by jm
How to beat the CAP theorem
Nathan "Storm" Marz on building a dual realtime/batch stack. This lines up with something I've been building in work, so I'm happy ;)
nathan-marz  realtime  batch  hadoop  storm  big-data  cap 
october 2011 by jm

related tags

acid  aerospike  airbnb  alex-feinberg  algorithms  allow_mult  ap  aphyr  architecture  availability  aws  backpressure  batch  big-data  ca  calm  cap  cap-theorem  checklists  chronos  clustering  coding  concurrency  consistency  couchbase  cp  crdt  crdts  cron  cs  damien-katz  data  data-structures  database  databases  david-ungar  design  distcomp  distributed  distributed-computing  distributed-databases  distributed-systems  docker  durability  dynamo  ec2  elasticsearch  enstratius  erlang  eureka  eventual-consistency  facebook  failure  faqs  fault-tolerance  funny  google  guidelines  ha  hadoop  horizontal-scaling  ibm  jepsen  kafka  last-write-wins  marc-brooker  mesos  metrics  mnesia  mongodb  multicore  nathan-marz  network  network-partitions  networking  nosql  ops  outages  pacelc  papers  parallel  parallelism  partition  partitions  pdf  pinterest  postgres  postgresql  quorum  race-and-repair  realtime  redis  reliability  replication  research  rethinkdb  riak  safety  scalability  scheduling  service-discovery  service-registry  smartstack  state  storage  stores  storm  streams  systems  testing  tests  text  toread  transactions  twitter  via:dehora  via:lusis  via:martin-thompson  voldemort  zookeeper 

Copy this bookmark: