jm + replication   39

October 21 post-incident analysis | The GitHub Blog
A network outage caused a split-brain scenario, and their failover system allowed writes to occur in both
regional databases. Once the outage was repaired it was impossible to reconcile writes in an automated fashion as a result.

Embarrassingly, this exact scenario was called out in their previous blog post about their Raft-based failover system at https://githubengineering.com/mysql-high-availability-at-github/ --

"In a data center isolation scenario, and assuming a master is in the isolated DC, apps in that DC are still able to write to the master. This may result in state inconsistency once network is brought back up. We are working to mitigate this split-brain by implementing a reliable STONITH from within the very isolated DC. As before, some time will pass before bringing down the master, and there could be a short period of split-brain. The operational cost of avoiding split-brains altogether is very high."

Failover is hard.
github  fail  outages  failover  replication  consensus  ops 
17 days ago by jm
Cross-Region Read Replicas for Amazon Aurora
Creating a read replica in another region also creates an Aurora cluster in the region. This cluster can contain up to 15 more read replicas, with very low replication lag (typically less than 20 ms) within the region (between regions, latency will vary based on the distance between the source and target). You can use this model to duplicate your cluster and read replica setup across regions for disaster recovery. In the event of a regional disruption, you can promote the cross-region replica to be the master. This will allow you to minimize downtime for your cross-region application. This feature applies to unencrypted Aurora clusters.
aws  mysql  databases  storage  replication  cross-region  failover  reliability  aurora 
june 2016 by jm
Jepsen: RethinkDB 2.1.5
A good review of RethinkDB! Hopefully not just because this test is contract work on behalf of the RethinkDB team ;)
I’ve run hundreds of test against RethinkDB at majority/majority, at various timescales, request rates, concurrencies, and with different types of failures. Consistent with the documentation, I have never found a linearization failure with these settings. If you use hard durability, majority writes, and majority reads, single-document ops in RethinkDB appear safe.
rethinkdb  databases  stores  storage  ops  availability  cap  jepsen  tests  replication 
january 2016 by jm
Uber Goes Unconventional: Using Driver Phones as a Backup Datacenter - High Scalability
Initially I thought they were just tracking client state on the phone, but it actually sounds like they're replicating other users' state, too. Mad stuff! Must cost a fortune in additional data transfer costs...
scalability  failover  multi-dc  uber  replication  state  crdts 
september 2015 by jm
S3's "s3-external-1.amazonaws.com" endpoint
public documentation of how to work around the legacy S3 multi-region replication behaviour in North America
aws  s3  eventual-consistency  consistency  us-east  replication  workarounds  legacy 
april 2015 by jm
Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth | USENIX
Erasure codes, such as Reed-Solomon (RS) codes, are increasingly being deployed as an alternative to data-replication for fault tolerance in distributed storage systems. While RS codes provide significant savings in storage space, they can impose a huge burden on the I/O and network resources when reconstructing failed or otherwise unavailable data. A recent class of erasure codes, called minimum-storage-regeneration (MSR) codes, has emerged as a superior alternative to the popular RS codes, in that it minimizes network transfers during reconstruction while also being optimal with respect to storage and reliability. However, existing practical MSR codes do not address the increasingly important problem of I/O overhead incurred during reconstructions, and are, in general, inferior to RS codes in this regard. In this paper, we design erasure codes that are simultaneously optimal in terms of I/O, storage, and network bandwidth. Our design builds on top of a class of powerful practical codes, called the product-matrix-MSR codes. Evaluations show that our proposed design results in a significant reduction the number of I/Os consumed during reconstructions (a 5 reduction for typical parameters), while retaining optimality with respect to storage, reliability, and network bandwidth.
erasure-coding  reed-solomon  compression  reliability  reconstruction  replication  fault-tolerance  storage  bandwidth  usenix  papers 
february 2015 by jm
Personalization at Spotify using Cassandra
Lots and lots of good detail into the Spotify C* setup (via Bill de hOra)
via:dehora  spotify  cassandra  replication  storage  ops 
january 2015 by jm
"Incremental Stream Processing using Computational Conflict-free Replicated Data Types" [paper]
'Unlike existing alternatives, such as stream processing, that favor the execution of arbitrary application code, we want to capture much of the processing logic as a set of known operations over specialized Computational CRDTs, with particular semantics and invariants, such as min/max/average/median registers, accumulators, top-N sets, sorted sets/maps, and so on. Keeping state also allows the system to decrease the amount of propagated information. Preliminary results obtained in a single example show that Titan has an higher throughput when compared with state of the art stream processing systems.'
crdt  distributed  stream-processing  replication  titan  papers 
january 2015 by jm
Aurora for MySQL is coming
'Anurag@AWS posts a quite interesting comment on Aurora failover: We asynchronously write to 6 copies and ack the write when we see four completions. So, traditional 4/6 quorums with synchrony as you surmised. Now, each log record can end up with a independent quorum from any other log record, which helps with jitter, but introduces some sophistication in recovery protocols. We peer to peer to fill in holes. We also will repair bad segments in the background, and downgrade to a 3/4 quorum if unable to place in an AZ for any extended period. You need a pretty bad failure to get a write outage.' (via High Scalability)
via:highscalability  mysql  aurora  failover  fault-tolerance  aws  replication  quorum 
december 2014 by jm
DynamoDB Streams
This is pretty awesome. All changes to a DynamoDB table can be streamed to a Kinesis stream, MySQL-replication-style.

The nice bit is that it has a solid way to ensure readers won't get overwhelmed by the stream volume (since ddb tables are IOPS-rate-limited), and Kinesis has a solid way to read missed updates (since it's a Kafka-style windowed persistent stream). With this you have a pretty reliable way to ensure you're not going to suffer data loss.
iops  dynamodb  aws  kinesis  reliability  replication  multi-az  multi-region  failover  streaming  kafka 
november 2014 by jm
Why We Didn’t Use Kafka for a Very Kafka-Shaped Problem
A good story of when Kafka _didn't_ fit the use case:
We came up with a complicated process of app-level replication for our messages into two separate Kafka clusters. We would then do end-to-end checking of the two clusters, detecting dropped messages in each cluster based on messages that weren’t in both.

It was ugly. It was clearly going to be fragile and error-prone. It was going to be a lot of app-level replication and horrible heuristics to see when we were losing messages and at least alert us, even if we couldn’t fix every failure case.

Despite us building a Kafka prototype for our ETL — having an existing investment in it — it just wasn’t going to do what we wanted. And that meant we needed to leave it behind, rewriting the ETL prototype.
cassandra  java  kafka  scala  network-partitions  availability  multi-region  multi-az  aws  replication  onlive 
november 2014 by jm
[KAFKA-1555] provide strong consistency with reasonable availability
Major improvements for Kafka consistency coming in 0.8.2; replication to multiple in-sync replicas, controlled by a new "min.isr" setting
kafka  replication  cap  consistency  streams 
october 2014 by jm
"The Tail at Scale"
by Jeffrey Dean and Luiz Andre Barroso, Google. A selection of Google's architectural mechanisms used to defeat 99th-percentile latency spikes: hedged requests, tied requests, micro-partitioning, selective replication, latency-induced probation, canary requests.
google  architecture  distcomp  soa  http  partitioning  replication  latency  99th-percentile  canary-requests  hedged-requests 
july 2014 by jm
Sirius by Comcast
At Comcast, our applications need convenient, low-latency access to important reference datasets. For example, our XfinityTV websites and apps need to use entertainment-related data to serve almost every API or web request to our datacenters: information like what year Casablanca was released, or how many episodes were in Season 7 of Seinfeld, or when the next episode of the Voice will be airing (and on which channel!).

We traditionally managed this information with a combination of relational databases and RESTful web services but yearned for something simpler than the ORM, HTTP client, and cache management code our developers dealt with on a daily basis. As main memory sizes on commodity servers continued to grow, however, we asked ourselves: How can we keep this reference data entirely in RAM, while ensuring it gets updated as needed and is easily accessible to application developers?

The Sirius distributed system library is our answer to that question, and we're happy to announce that we've made it available as an open source project. Sirius is written in Scala and uses the Akka actor system under the covers, but is easily usable by any JVM-based language.

Also includes a Paxos implementation with "fast follower" read-only slave replication. ASL2-licensed open source.

The only thing I can spot to be worried about is speed of startup; they note that apps need to replay a log at startup to rebuild state, which can be slow if unoptimized in my experience.

Update: in a twitter conversation at https://twitter.com/jon_moore/status/459363751893139456 , Jon Moore indicated they haven't had problems with this even with 'datasets consuming 10-20GB of heap', and have 'benchmarked a 5-node Sirius ingest cluster up to 1k updates/sec write throughput.' That's pretty solid!
open-source  comcast  paxos  replication  read-only  datastores  storage  memory  memcached  redis  sirius  scala  akka  jvm  libraries 
april 2014 by jm
Replicant: Replicated State Machines Made Easy
The next time you reach for ZooKeeper, ask yourself whether it provides the primitive you really need. If ZooKeeper's filesystem and znode abstractions truly meet your needs, great. But the odds are, you'll be better off writing your application as a replicated state machine.
zookeeper  paxos  replicant  replication  consensus  state-machines  distcomp 
december 2013 by jm
3D-Print Your Own 20-Million-Year-Old Fossils
When I get my hands on a 3-D printer, this will be high up my list of things to fabricate: a replica of a 20-million year old hominid skull.
With over 40 digitized fossils in their collection, you can explore 3D renders of fossils representing prehistoric animals, human ancestors, and even ancient tools. Captured using Autodesk software, an SLR camera, and often the original specimen (rather than a cast replica), these renderings bring us closer than most will ever get to holding ancient artifacts. And if you've got an additive manufacturing device at your disposal, you can even download Sketchfab plans to generate your own.
3d-printing  fossils  africa  history  hominids  replication  fabrication  sketchfab 
november 2013 by jm
Mike Hearn - Google+ - The packet capture shown in these new NSA slides shows…
The packet capture shown in these new NSA slides shows internal database replication traffic for the anti-hacking system I worked on for over two years. Specifically, it shows a database recording a user login.


This kind of confirms my theory that the majority of interesting traffic for the NSA/GCHQ MUSCULAR sniffing system would have been inter-DC replication. Was, since it sounds like that stuff's all changing now to use end-to-end crypto...
google  crypto  security  muscular  nsa  gchq  mike-hearn  replication  sniffing  spying  surveillance 
november 2013 by jm
Rapid read protection in Cassandra 2.0.2
Nifty new feature -- if a request takes over the 99th percentile for requests to that server, it'll be repeated against another replica. Unnecessary for Voldemort, of course, which queries all replicas anyway!
cassandra  nosql  replication  distcomp  latency  storage 
october 2013 by jm
Call me maybe: Kafka
Aphyr takes a look at Kafka 0.8's replication with the Jepsen test suite. It doesn't go great. Jay Kreps responds here: http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen
jay-kreps  kafka  replication  distributed-systems  distcomp  networking  reliability  fault-tolerance  jepsen 
september 2013 by jm
Information on Google App Engine's recent US datacenter relocations - Google Groups
or, really, 'why we had some glitches and outages recently'. A few interesting tidbits about GAE innards though (via Bill De hOra)
gae  google  app-engine  outages  ops  paxos  eventual-consistency  replication  storage  hrd 
august 2013 by jm
_In Search of an Understandable Consensus Algorithm_, Diego Ongaro and John Ousterhout, Stanford

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election and log replication, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety. Results from a user study demonstrate that Raft is easier for students to learn than Paxos.
distributed  algorithms  paxos  raft  consensus-algorithms  distcomp  leader-election  replication  clustering 
august 2013 by jm
etcd
A highly-available key value store for shared configuration and service discovery. etcd is inspired by zookeeper and doozer, with a focus on:

Simple: curl'able user facing API (HTTP+JSON);
Secure: optional SSL client cert authentication;
Fast: benchmarked 1000s of writes/s per instance;
Reliable: Properly distributed using Raft;

Etcd is written in go and uses the raft consensus algorithm to manage a highly availably replicated log.

One of the core components of CoreOS -- http://coreos.com/ .
configuration  distributed  raft  ha  doozer  zookeeper  go  replication  consensus-algorithm  etcd  coreos 
august 2013 by jm
Twilio Billing Incident Post-Mortem
At 1:35 AM PDT on July 18, a loss of network connectivity caused all billing redis-slaves to simultaneously disconnect from the master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.
By 2:39 AM PDT the host’s load became so extreme, services relying on redis-master began to fail. At 2:42 AM PDT, our monitoring system alerted our on-call engineering team of a failure in the Redis cluster. Observing extreme load on the host, the redis process on redis-master was misdiagnosed as requiring a restart to recover. This caused redis-master to read an incorrect configuration file, which in turn caused Redis to attempt to recover from a non-existent AOF file, instead of the binary snapshot. As a result of that failed recovery, redis-master dropped all balance data. In addition to forcing recovery from a non-existent AOF, an incorrect configuration also caused redis-master to boot as a slave of itself, putting it in read-only mode and preventing the billing system from updating account balances.

See also http://antirez.com/news/60 for antirez' response.

Here's the takeaways I'm getting from it:

1. network partitions happen in production, and cause cascading failures. this is a great demo of that.

2. don't store critical data in Redis. this was the case for Twilio -- as far as I can tell they were using Redis as a front-line cache for billing data -- but it's worth saying anyway. ;)

3. Twilio were just using Redis as a cache, but a bug in their code meant that the writes to the backing SQL store were not being *read*, resulting in repeated billing and customer impact. In other words, it turned a (fragile) cache into the authoritative store.

4. they should probably have designed their code so that write failures would not result in repeated billing for customers -- that's a bad failure path.

Good post-mortem anyway, and I'd say their customers are a good deal happier to see this published, even if it contains details of the mistakes they made along the way.
redis  caching  storage  networking  network-partitions  twilio  postmortems  ops  billing  replication 
july 2013 by jm
'Copysets: Reducing the Frequency of Data Loss in Cloud Storage' [paper]
An improved replica-selection algorithm for replicated storage systems.

We present Copyset Replication, a novel general purpose replication technique that significantly reduces the frequency of data loss events. We implemented and evaluated Copyset Replication on two open source data center storage systems, HDFS and RAMCloud, and show it incurs a low overhead on all operations. Such systems require that each node’s data be scattered across several nodes for parallel data recovery and access. Copyset Replication presents a near optimal tradeoff between the number of nodes on which the data is scattered and the probability of data loss. For example, in a 5000-node RAMCloud cluster under a power outage, Copyset Replication reduces the probability of data loss from 99.99% to 0.15%. For Facebook’s HDFS cluster, it reduces the probability from 22.8% to 0.78%.
storage  cloud-storage  replication  data  reliability  fault-tolerance  copysets  replicas  data-loss 
july 2013 by jm
Facebook announce Wormhole
Over the last couple of years, we have built and deployed a reliable publish-subscribe system called Wormhole. Wormhole has become a critical part of Facebook's software infrastructure. At a high level, Wormhole propagates changes issued in one system to all systems that need to reflect those changes – within and across data centers.


Facebook's Kafka-alike, basically, although with some additional low-latency guarantees. FB appear to be using it for multi-region and multi-AZ replication. Proprietary.
pub-sub  scalability  facebook  realtime  low-latency  multi-region  replication  multi-az  wormhole 
june 2013 by jm
Is Your MySQL Buffer Pool Warm? Make It Sweat!
How GroupOn are warming up a failover warm MySQL spare, using Percona stuff and a "tee" of the live in-flight queries. (via Dave Doran)
via:dave-doran  mysql  databases  warm-spares  spares  failover  groupon  percona  replication 
april 2013 by jm
Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node
an excellent writeup on Kafka 0.8's use and operation, including details of the new replication features
kafka  replication  queueing  distributed  ops 
april 2013 by jm
TouchDB's reverse-engineered write-up of the Couch replication protocol
There really isn’t a separate “protocol” per se for replication. Instead, replication uses CouchDB’s REST API and data model. It’s therefore a bit difficult to talk about replication independently of the rest of CouchDB. In this document I’ll focus on the algorithm used, and link to documentation of the APIs it invokes. The “protocol” is simply the set of those APIs operating over HTTP.
couchdb  protocols  touchdb  nosql  replication  sync  mvcc  revisions  rest 
april 2013 by jm
KDE's brush with git repository corruption: post-mortem
a barely-averted disaster... phew.

while we planned for the case of the server losing a disk or entirely biting the dust, or the total loss of the VM’s filesystem, we didn’t plan for the case of filesystem corruption, and the way the corruption affected our mirroring system triggered some very unforeseen and pathological conditions. [...] the corruption was perfectly mirrored... or rather, due to its nature, imperfectly mirrored. And all data on the anongit [mirrors] was lost.

One risk demonstrated: by trusting in mirroring, rather than a schedule of snapshot backups covering a wide time range, they nearly had a major outage. Silent data corruption, and code bugs, happen -- backups protect against this, but RAID, replication, and mirrors do not.

Another risk: they didn't have a rate limit on project-deletion, which resulted in the "anongit" mirrors deleting their (safe) data copies in response to the upstream corruption. Rate limiting to sanity-check automated changes is vital. What they should have had in place was described by the fix: 'If a new projects file is generated and is more than 1% different than the previous file, the previous file is kept intact (at 1500 repositories, that means 15 repositories would have to be created or deleted in the span of three minutes, which is extremely unlikely).'
rate-limiting  case-studies  post-mortems  kde  git  data-corruption  risks  mirroring  replication  raid  bugs  backups  snapshots  sanity-checks  automation  ops 
march 2013 by jm
Running the Largest Hadoop DFS Cluster
Facebook's 1PB Hadoop cluster. features improved NameNode availability work and 4 levels of data aging, with reduced replication and Reed-Solomon RAID encoding for colder data ages
aging  data  facebook  hadoop  hdfs  reed-solomon  error-correction  replication  erasure-coding 
march 2013 by jm
Two Sides For Salvation « Code as Craft
Etsy's MySQL master-master pair configuration, and how it allows no-downtime schema changes
database  etsy  mysql  replication  schema  availability  downtime 
december 2012 by jm
How we use Redis at Bump
via Simon Willison. some nice ideas here, particularly using a replication slave to handle the potentially latency-impacting disk writes in AOF mode
queueing  redis  nosql  databases  storage  via:simonw  replication  bump 
july 2011 by jm
Maatkit
MySQL/PostgreSQL admin helper tools -- check replication status, archive, analyse logs, find deadlocks
sysadmin  db  mysql  replication  maatkit  dba  from delicious
october 2010 by jm

related tags

3d-printing  99th-percentile  africa  aging  akka  algorithms  app-engine  architecture  aurora  automation  availability  aws  backups  bandwidth  billing  bugs  bump  caching  canary-requests  cap  case-studies  cassandra  cdt  cloud-storage  clustering  coding  comcast  compression  configuration  consensus  consensus-algorithm  consensus-algorithms  consistency  copysets  coreos  couchdb  crdt  crdts  cross-region  crypto  data  data-corruption  data-loss  database  databases  datastores  db  dba  distcomp  distributed  distributed-systems  doozer  downtime  dynamo  dynamodb  ec2  erasure-coding  error-correction  etcd  etsy  eventual-consistency  fabrication  facebook  fail  failover  fault-tolerance  fossils  gae  gchq  git  github  gizzard  go  google  graph  groupon  ha  hadoop  hdfs  hedged-requests  history  hominids  hrd  http  inter-region  iops  java  jay-kreps  jepsen  jvm  kafka  kde  kinesis  latency  leader-election  legacy  libraries  linkedin  low-latency  maatkit  memcached  memory  merging  mike-hearn  mirroring  multi-az  multi-dc  multi-region  muscular  mvcc  mysql  network-partitions  networking  nosql  nsa  onlive  open-source  ops  outages  papers  partitioning  paxos  percona  post-mortems  postmortems  protocols  pub-sub  querying  queueing  quorum  raft  raid  rate-limiting  read-only  realtime  reconstruction  redis  reed-solomon  reliability  replicant  replicas  replication  rest  rethinkdb  revisions  riak  risks  s3  sanity-checks  scala  scalability  schema  security  set  set-cover  sharding  sirius  sketchfab  snapshots  sniffing  soa  spares  spotify  spying  sql  state  state-machines  storage  stores  stream-processing  streaming  streams  surveillance  sync  sysadmin  tests  timelines  titan  touchdb  transactions  twilio  twitter  uber  us-east  usenix  via:dave-doran  via:dehora  via:highscalability  via:simonw  voldemort  warm-spares  workarounds  wormhole  zookeeper 

Copy this bookmark:



description:


tags: