jm + consensus   8

October 21 post-incident analysis | The GitHub Blog
A network outage caused a split-brain scenario, and their failover system allowed writes to occur in both
regional databases. Once the outage was repaired it was impossible to reconcile writes in an automated fashion as a result.

Embarrassingly, this exact scenario was called out in their previous blog post about their Raft-based failover system at https://githubengineering.com/mysql-high-availability-at-github/ --

"In a data center isolation scenario, and assuming a master is in the isolated DC, apps in that DC are still able to write to the master. This may result in state inconsistency once network is brought back up. We are working to mitigate this split-brain by implementing a reliable STONITH from within the very isolated DC. As before, some time will pass before bringing down the master, and there could be a short period of split-brain. The operational cost of avoiding split-brains altogether is very high."

Failover is hard.
github  fail  outages  failover  replication  consensus  ops 
17 days ago by jm
Stellar/Ripple suffer a failure of their consensus system, resulting in a split-brain failure
Prof. Mazières’s research indicated some risk that consensus could fail, though we were nor certain if the required circumstances for such a failure were realistic. This week, we discovered the first instance of a consensus failure. On Tuesday night, the nodes on the network began to disagree and caused a fork of the ledger. The majority of the network was on ledger chain A. At some point, the network decided to switch to ledger chain B. This caused the roll back of a few hours of transactions that had only been recorded on chain A. We were able to replay most of these rolled back transactions on chain B to minimize the impact. However, in cases where an account had already sent a transaction on chain B the replay wasn’t possible.
consensus  distcomp  stellar  ripple  split-brain  postmortems  outages  ledger-fork  payment 
december 2014 by jm
"Ark: A Real-World Consensus Implementation" [paper]
"an implementation of a consensus algorithm similar to Paxos and Raft, designed as an improvement over the existing consensus algorithm used by MongoDB and TokuMX."

It'll be interesting to see how this gets on in review from the distributed-systems community. The phrase "similar to Paxos and Raft" is both worrying and promising ;)
paxos  raft  consensus  algorithms  distsys  distributed  leader-election  mongodb  tokumx 
july 2014 by jm
"In Search of an Understandable Consensus Algorithm"
Diego Ongaro and John Ousterhout, USENIX ATC 2014 -- won best paper for this paper on the Raft algorithm. (via Eoin Brazil)
raft  consensus  algorithms  distcomp  john-ousterhout  via:eoinbrazil  usenix  atc  papers  paxos 
july 2014 by jm
CockroachDB
a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Cockroach nodes are symmetric; a design goal is one binary with minimal configuration and no required auxiliary services.

Cockroach implements a single, monolithic sorted map from key to value where both keys and values are byte strings (not unicode). Cockroach scales linearly (theoretically up to 4 exabytes (4E) of logical data). The map is composed of one or more ranges and each range is backed by data stored in RocksDB (a variant of LevelDB), and is replicated to a total of three or more cockroach servers. Ranges are defined by start and end keys. Ranges are merged and split to maintain total byte size within a globally configurable min/max size interval. Range sizes default to target 64M in order to facilitate quick splits and merges and to distribute load at hotspots within a key range. Range replicas are intended to be located in disparate datacenters for survivability (e.g. { US-East, US-West, Japan }, { Ireland, US-East, US-West}, { Ireland, US-East, US-West, Japan, Australia }).

Single mutations to ranges are mediated via an instance of a distributed consensus algorithm to ensure consistency. We’ve chosen to use the Raft consensus algorithm. All consensus state is stored in RocksDB.

A single logical mutation may affect multiple key/value pairs. Logical mutations have ACID transactional semantics. If all keys affected by a logical mutation fall within the same range, atomicity and consistency are guaranteed by Raft; this is the fast commit path. Otherwise, a non-locking distributed commit protocol is employed between affected ranges.

Cockroach provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes--both from an historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. Cockroach implements a limited form of linearalizability, providing ordering for any observer or chain of observers.


This looks nifty. One to watch.
cockroachdb  databases  storage  georeplication  raft  consensus  acid  go  key-value-stores  rocksdb 
may 2014 by jm
Replicant: Replicated State Machines Made Easy
The next time you reach for ZooKeeper, ask yourself whether it provides the primitive you really need. If ZooKeeper's filesystem and znode abstractions truly meet your needs, great. But the odds are, you'll be better off writing your application as a replicated state machine.
zookeeper  paxos  replicant  replication  consensus  state-machines  distcomp 
december 2013 by jm
mgodave/barge
Looks very alpha, but one to watch.
A JVM Implementation of the Raft Consensus Protocol
via:sbtourist  raft  jvm  java  consensus  distributed-computing 
november 2013 by jm

Copy this bookmark:



description:


tags: