jm + partition   4

Call me maybe: Elasticsearch 1.5.0
tl;dr: Elasticsearch still hoses data integrity on partition, badly
elasticsearch  reliability  data  storage  safety  jepsen  testing  aphyr  partition  network-partitions  cap 
may 2015 by jm
Call me maybe: Elasticsearch
Wow, these are terrible results. From the sounds of it, ES just cannot deal with realistic outage scenarios and is liable to suffer catastrophic damage in reasonably-common partitions.
If you are an Elasticsearch user (as I am): good luck. Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present. If you can, store your data in a safer database, and feed it into Elasticsearch gradually. Have processes in place that continually traverse the system of record, so you can recover from ES data loss automatically.
elasticsearch  ops  storage  databases  jepsen  partition  network  outages  reliability 
june 2014 by jm
The network is reliable
Aphyr and Peter Bailis collect an authoritative list of known network partition and outage cases from published post-mortem data:

This post is meant as a reference point -- to illustrate that, according to a wide range of accounts, partitions occur in many real-world environments. Processes, servers, NICs, switches, local and wide area networks can all fail, and the resulting economic consequences are real. Network outages can suddenly arise in systems that are stable for months at a time, during routine upgrades, or as a result of emergency maintenance. The consequences of these outages range from increased latency and temporary unavailability to inconsistency, corruption, and data loss. Split-brain is not an academic concern: it happens to all kinds of systems -- sometimes for days on end. Partitions deserve serious consideration.


I honestly cannot understand people who didn't think this was the case. 3 years reading (and occasionally auto-cutting) Amazon's network-outage tickets as part of AWS network monitoring will do that to you I guess ;)
networking  outages  partition  cap  failure  fault-tolerance 
june 2013 by jm
Riak, CAP, and eventual consistency
Good (albeit draft) write-up of the implications of CAP, allow_mult, and last_write_wins conflict-resolution policies in Riak:
As Brewer's CAP theorem established, distributed systems have to make hard choices. Network partition is inevitable. Hardware failure is inevitable. When a partition occurs, a well-behaved system must choose its behavior from a spectrum of options ranging from "stop accepting any writes until the outage is resolved" (thus maintaining absolute consistency) to "allow any writes and worry about consistency later" (to maximize availability). Riak leans toward the availability end of the spectrum, but allows the operator and even the developer to tune read and write requests to better meet the business needs for any given set of data.
riak  cap  eventual-consistency  distcomp  distributed-systems  partition  last-write-wins  voldemort  allow_mult 
april 2013 by jm

Copy this bookmark:



description:


tags: