jm + network-partitions   7

Call me maybe: Elasticsearch 1.5.0
tl;dr: Elasticsearch still hoses data integrity on partition, badly
elasticsearch  reliability  data  storage  safety  jepsen  testing  aphyr  partition  network-partitions  cap 
may 2015 by jm
You Cannot Have Exactly-Once Delivery
Cut out and keep:
Within the context of a distributed system, you cannot have exactly-once message delivery. Web browser and server? Distributed. Server and database? Distributed. Server and message queue? Distributed. You cannot have exactly-once delivery semantics in any of these situations.
distributed  distcomp  exactly-once-delivery  networking  outages  network-partitions  byzantine-generals  reference 
march 2015 by jm
Why You Shouldn’t Use ZooKeeper for Service Discovery
In CAP terms, ZooKeeper is CP, meaning that it’s consistent in the face of partitions, not available. For many things that ZooKeeper does, this is a necessary trade-off. Since ZooKeeper is first and foremost a coordination service, having an eventually consistent design (being AP) would be a horrible design decision. Its core consensus algorithm, Zab, is therefore all about consistency. For coordination, that’s great. But for service discovery it’s better to have information that may contain falsehoods than to have no information at all. It is much better to know what servers were available for a given service five minutes ago than to have no idea what things looked like due to a transient network partition. The guarantees that ZooKeeper makes for coordination are the wrong ones for service discovery, and it hurts you to have them.

Yes! I've been saying this for months -- good to see others concurring.
architecture  zookeeper  eureka  outages  network-partitions  service-discovery  cap  partitions 
december 2014 by jm
Why We Didn’t Use Kafka for a Very Kafka-Shaped Problem
A good story of when Kafka _didn't_ fit the use case:
We came up with a complicated process of app-level replication for our messages into two separate Kafka clusters. We would then do end-to-end checking of the two clusters, detecting dropped messages in each cluster based on messages that weren’t in both.

It was ugly. It was clearly going to be fragile and error-prone. It was going to be a lot of app-level replication and horrible heuristics to see when we were losing messages and at least alert us, even if we couldn’t fix every failure case.

Despite us building a Kafka prototype for our ETL — having an existing investment in it — it just wasn’t going to do what we wanted. And that meant we needed to leave it behind, rewriting the ETL prototype.
cassandra  java  kafka  scala  network-partitions  availability  multi-region  multi-az  aws  replication  onlive 
november 2014 by jm
SmartStack vs. Consul
One of the SmartStack developers at AirBNB responds to's comments. FWIW, we use SmartStack in Swrve and it works pretty well...
smartstack  airbnb  ops  consul  serf  load-balancing  availability  resiliency  network-partitions  outages 
may 2014 by jm
Resiliency And Elasticsearch
Blog post from the ES team. They use "evil tests" -- basically unit/system tests, particularly using randomized error-injecting mock infrastructure. Good practices; I've done the same myself quite recently for Swrve's realtime infrastructure
elasticsearch  resiliency  network-partitions  reliability  testing  mocking  error-injection 
april 2014 by jm
Twilio Billing Incident Post-Mortem
At 1:35 AM PDT on July 18, a loss of network connectivity caused all billing redis-slaves to simultaneously disconnect from the master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.
By 2:39 AM PDT the host’s load became so extreme, services relying on redis-master began to fail. At 2:42 AM PDT, our monitoring system alerted our on-call engineering team of a failure in the Redis cluster. Observing extreme load on the host, the redis process on redis-master was misdiagnosed as requiring a restart to recover. This caused redis-master to read an incorrect configuration file, which in turn caused Redis to attempt to recover from a non-existent AOF file, instead of the binary snapshot. As a result of that failed recovery, redis-master dropped all balance data. In addition to forcing recovery from a non-existent AOF, an incorrect configuration also caused redis-master to boot as a slave of itself, putting it in read-only mode and preventing the billing system from updating account balances.

See also for antirez' response.

Here's the takeaways I'm getting from it:

1. network partitions happen in production, and cause cascading failures. this is a great demo of that.

2. don't store critical data in Redis. this was the case for Twilio -- as far as I can tell they were using Redis as a front-line cache for billing data -- but it's worth saying anyway. ;)

3. Twilio were just using Redis as a cache, but a bug in their code meant that the writes to the backing SQL store were not being *read*, resulting in repeated billing and customer impact. In other words, it turned a (fragile) cache into the authoritative store.

4. they should probably have designed their code so that write failures would not result in repeated billing for customers -- that's a bad failure path.

Good post-mortem anyway, and I'd say their customers are a good deal happier to see this published, even if it contains details of the mistakes they made along the way.
redis  caching  storage  networking  network-partitions  twilio  postmortems  ops  billing  replication 
july 2013 by jm

Copy this bookmark: