jm + failover   12

October 21 post-incident analysis | The GitHub Blog
A network outage caused a split-brain scenario, and their failover system allowed writes to occur in both
regional databases. Once the outage was repaired it was impossible to reconcile writes in an automated fashion as a result.

Embarrassingly, this exact scenario was called out in their previous blog post about their Raft-based failover system at --

"In a data center isolation scenario, and assuming a master is in the isolated DC, apps in that DC are still able to write to the master. This may result in state inconsistency once network is brought back up. We are working to mitigate this split-brain by implementing a reliable STONITH from within the very isolated DC. As before, some time will pass before bringing down the master, and there could be a short period of split-brain. The operational cost of avoiding split-brains altogether is very high."

Failover is hard.
github  fail  outages  failover  replication  consensus  ops 
17 days ago by jm
Cross-Region Read Replicas for Amazon Aurora
Creating a read replica in another region also creates an Aurora cluster in the region. This cluster can contain up to 15 more read replicas, with very low replication lag (typically less than 20 ms) within the region (between regions, latency will vary based on the distance between the source and target). You can use this model to duplicate your cluster and read replica setup across regions for disaster recovery. In the event of a regional disruption, you can promote the cross-region replica to be the master. This will allow you to minimize downtime for your cross-region application. This feature applies to unencrypted Aurora clusters.
aws  mysql  databases  storage  replication  cross-region  failover  reliability  aurora 
june 2016 by jm
Chaos Engineering Upgraded
some details on Netflix's Chaos Monkey, Chaos Kong and other aspects of their availability/failover testing
architecture  aws  netflix  ops  chaos-monkey  chaos-kong  testing  availability  failover  ha 
september 2015 by jm
Uber Goes Unconventional: Using Driver Phones as a Backup Datacenter - High Scalability
Initially I thought they were just tracking client state on the phone, but it actually sounds like they're replicating other users' state, too. Mad stuff! Must cost a fortune in additional data transfer costs...
scalability  failover  multi-dc  uber  replication  state  crdts 
september 2015 by jm
Aurora for MySQL is coming
'Anurag@AWS posts a quite interesting comment on Aurora failover: We asynchronously write to 6 copies and ack the write when we see four completions. So, traditional 4/6 quorums with synchrony as you surmised. Now, each log record can end up with a independent quorum from any other log record, which helps with jitter, but introduces some sophistication in recovery protocols. We peer to peer to fill in holes. We also will repair bad segments in the background, and downgrade to a 3/4 quorum if unable to place in an AZ for any extended period. You need a pretty bad failure to get a write outage.' (via High Scalability)
via:highscalability  mysql  aurora  failover  fault-tolerance  aws  replication  quorum 
december 2014 by jm
DynamoDB Streams
This is pretty awesome. All changes to a DynamoDB table can be streamed to a Kinesis stream, MySQL-replication-style.

The nice bit is that it has a solid way to ensure readers won't get overwhelmed by the stream volume (since ddb tables are IOPS-rate-limited), and Kinesis has a solid way to read missed updates (since it's a Kafka-style windowed persistent stream). With this you have a pretty reliable way to ensure you're not going to suffer data loss.
iops  dynamodb  aws  kinesis  reliability  replication  multi-az  multi-region  failover  streaming  kafka 
november 2014 by jm
Game Day Exercises at Stripe: Learning from `kill -9`
We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.

Excellent post. Game days are a great idea. Also: massive Redis clustering fail
game-days  redis  testing  stripe  outages  ops  kill-9  failover 
october 2014 by jm
Amazon Route 53 Infima
Colm McCarthaigh has open sourced Infima, 'a library for managing service-level fault isolation using Amazon Route 53'.
Infima provides a Lattice container framework that allows you to categorize each endpoint along one or more fault-isolation dimensions such as availability-zone, software implementation, underlying datastore or any other common point of dependency endpoints may share.

Infima also introduces a new ShuffleShard sharding type that can exponentially increase the endpoint-level isolation between customer/object access patterns or any other identifier you choose to shard on.

Both Infima Lattices and ShuffleShards can also be automatically expressed in Route 53 DNS failover configurations using AnswerSet and RubberTree.
infima  colmmacc  dns  route-53  fault-tolerance  failover  multi-az  sharding  service-discovery 
november 2013 by jm
Building a Modern Website for Scale (QCon NY 2013) [slides]
some great scalability ideas from LinkedIn. Particularly interesting are the best practices suggested for scaling web services:

1. store client-call timeouts and SLAs in Zookeeper for each REST endpoint;
2. isolate backend calls using async/threadpools;
3. cancel work on failures;
4. avoid sending requests to GC'ing hosts;
5. rate limits on the server.

#4 is particularly cool. They do this using a "GC scout" request before every "real" request; a cheap TCP request to a dedicated "scout" Netty port, which replies near-instantly. If it comes back with a 1-packet response within 1 millisecond, send the real request, else fail over immediately to the next host in the failover set.

There's still a potential race condition where the "GC scout" can be achieved quickly, then a GC starts just before the "real" request is issued. But the incidence of GC-blocking-request is probably massively reduced.

It also helps against packet loss on the rack or server host, since packet loss will cause the drop of one of the TCP packets, and the TCP retransmit timeout will certainly be higher than 1ms, causing the deadline to be missed. (UDP would probably work just as well, for this reason.) However, in the case of packet loss in the client's network vicinity, it will be vital to still attempt to send the request to the final host in the failover set regardless of a GC-scout failure, otherwise all requests may be skipped.

The GC-scout system also helps balance request load off heavily-loaded hosts, or hosts with poor performance for other reasons; they'll fail to achieve their 1 msec deadline and the request will be shunted off elsewhere.

For service APIs with real low-latency requirements, this is a great idea.
gc-scout  gc  java  scaling  scalability  linkedin  qcon  async  threadpools  rest  slas  timeouts  networking  distcomp  netty  tcp  udp  failover  fault-tolerance  packet-loss 
june 2013 by jm
Is Your MySQL Buffer Pool Warm? Make It Sweat!
How GroupOn are warming up a failover warm MySQL spare, using Percona stuff and a "tee" of the live in-flight queries. (via Dave Doran)
via:dave-doran  mysql  databases  warm-spares  spares  failover  groupon  percona  replication 
april 2013 by jm
Fault Tolerance in a High Volume, Distributed System
Netflix's "DependencyCommand", a resiliency system for SOA inter-service network calls, offering builtin support for threadpools, timeouts, retries and graceful failover. Very nice
netflix  architecture  concurrency  distributed  failover  ha  resiliency  fail-fast  failsafe  soa  fault-tolerance 
march 2012 by jm

Copy this bookmark: