jm + failure   14

Historic S3 data corruption due to a fault load balancer
This came up in a discussion of using hashes for end-to-end data resiliency on the og-aws slack. Turns out AWS support staff wrote it up at the time:
We've isolated this issue to a single load balancer that was brought into service at 10:55pm PDT on Friday, 6/20 [2008].  It was taken out of service at 11am PDT Sunday, 6/22.  While it was in service it handled a small fraction of Amazon S3's total requests in the US.  Intermittently, under load, it was corrupting single bytes in the byte stream.  When the requests reached Amazon S3, if the Content-MD5 header was specified, Amazon S3 returned an error indicating the object did not match the MD5 supplied.  When no MD5 is specified, we are unable to determine if transmission errors occurred, and Amazon S3 must assume that the object has been correctly transmitted. Based on our investigation with both internal and external customers, the small amount of traffic received by this particular load balancer, and the intermittent nature of the above issue on this one load balancer, this appears to have impacted a very small portion of PUTs during this time frame.

One of the things we'll do is improve our logging of requests with MD5s, so that we can look for anomalies in their 400 error rates.  Doing this will allow us to provide more proactive notification on potential transmission issues in the future, for customers who use MD5s and those who do not. In addition to taking the actions noted above, we encourage all of our customers to take advantage of mechanisms designed to protect their applications from incorrect data transmission.  For all PUT requests, Amazon S3 computes its own MD5, stores it with the object, and then returns the computed MD5 as part of the PUT response code in the ETag.  By validating the ETag returned in the response, customers can verify that Amazon S3 received the correct bytes even if the Content MD5 header wasn't specified in the PUT request.  Because network transmission errors can occur at any point between the customer and Amazon S3, we recommend that all customers use the Content-MD5 header and/or validate the ETag returned on a PUT request to ensure that the object was correctly transmitted.  This is a best practice that we'll emphasize more heavily in our documentation to help customers build applications that can handle this situation.
aws  s3  outages  postmortems  load-balancing  data-corruption  corruption  failure  md5  hashing  hashes 
yesterday by jm
Ironies of automation
Wow, this is a great paper recommendation from Adrian Colyer - 'Ironies of automation', Bainbridge, Automatica, Vol. 19, No. 6, 1983.
In an automated system, two roles are left to humans: monitoring that the automated system is operating correctly, and taking over control if it isn’t. An operator that doesn’t routinely operate the system will have atrophied skills if ever called on to take over.

Unfortunately, physical skills deteriorate when they are not used, particularly the refinements of gain and timing. This means that a formerly experienced operator who has been monitoring an automated process may now be an inexeperienced one.

Not only are the operator’s skills declining, but the situations when the operator will be called upon are by their very nature the most demanding ones where something is deemed to be going wrong. Thus what we really need in such a situation is a more, not a lesser skilled operator! To generate successful strategies for unusual situtations, an operator also needs good understanding of the process under control, and the current state of the system. The former understanding develops most effectively through use and feedback (which the operator may no longer be getting the regular opportunity for), the latter takes some time to assimilate.

(via John Allspaw)
via:allspaw  automation  software  reliability  debugging  ops  design  failsafe  failure  human-interfaces  ui  ux  outages 
13 days ago by jm
Should create a separate Hystrix Thread pool for each remote call?
Excellent advice on capacity planning and queueing theory, in the context of Hystrix. Should I use a single thread pool for all dependency callouts, or independent thread pools for each one?
threadpools  pooling  hystrix  capacity  queue-theory  queueing  queues  failure  resilience  soa  microservices 
may 2016 by jm
How Facebook avoids failures
Great paper from Ben Maurer of Facebook in ACM Queue.
A "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves.

This is full of interesting techniques.

* Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert.

* Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills.

* Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm -- this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker).

* Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed?

* Learning from Failure: the DERP (!) methodology,
ben-maurer  facebook  reliability  algorithms  codel  circuit-breakers  derp  failure  ops  cubism  horizon-charts  charts  dependencies  soa  microservices  uptime  deployment  configuration  change-management 
november 2015 by jm
a proxy that mucks with your system and application context, operating at Layers 4 and 7, allowing you to simulate common failure scenarios from the perspective of an application under test; such as an API or a web application. If you are building a distributed system, Muxy can help you test your resilience and fault tolerance patterns.
proxy  distributed  testing  web  http  fault-tolerance  failure  injection  tcp  delay  resilience  error-handling 
september 2015 by jm
Call me maybe: RabbitMQ
We used Knossos and Jepsen to prove the obvious: RabbitMQ is not a lock service. That investigation led to a discovery hinted at by the documentation: in the presence of partitions, RabbitMQ clustering will not only deliver duplicate messages, but will also drop huge volumes of acknowledged messages on the floor. This is not a new result, but it may be surprising if you haven’t read the docs closely–especially if you interpreted the phrase “chooses Consistency and Partition Tolerance” to mean, well, either of those things.
rabbitmq  network  partitions  failure  cap-theorem  consistency  ops  reliability  distcomp  jepsen 
june 2014 by jm
Failure Friday: How We Ensure PagerDuty is Always Reliable
Basically, they run the kind of exercise which Jesse Robbins invented at Amazon -- "Game Days". Scarily, they do these on a Friday -- living dangerously!
game-days  testing  failure  devops  chaos-monkey  ops  exercises 
november 2013 by jm
Backblaze Blog » How long do disk drives last?
According to Backblaze's data, 80% of drives last 4 years, and the median lifespan is projected to be 6 years
backblaze  storage  disk  ops  mtbf  hardware  failure  lifespan 
november 2013 by jm
_Availability in Globally Distributed Storage Systems_ [pdf]
empirical BigTable and GFS failure numbers from Google are orders of magnitude higher than naïve independent-failure models. (via kragen)
via:kragen  failure  bigtable  gfs  statistics  outages  reliability 
september 2013 by jm
The network is reliable
Aphyr and Peter Bailis collect an authoritative list of known network partition and outage cases from published post-mortem data:

This post is meant as a reference point -- to illustrate that, according to a wide range of accounts, partitions occur in many real-world environments. Processes, servers, NICs, switches, local and wide area networks can all fail, and the resulting economic consequences are real. Network outages can suddenly arise in systems that are stable for months at a time, during routine upgrades, or as a result of emergency maintenance. The consequences of these outages range from increased latency and temporary unavailability to inconsistency, corruption, and data loss. Split-brain is not an academic concern: it happens to all kinds of systems -- sometimes for days on end. Partitions deserve serious consideration.

I honestly cannot understand people who didn't think this was the case. 3 years reading (and occasionally auto-cutting) Amazon's network-outage tickets as part of AWS network monitoring will do that to you I guess ;)
networking  outages  partition  cap  failure  fault-tolerance 
june 2013 by jm
CAP Confusion: Problems with ‘partition tolerance’
Another good clarification about CAP which resurfaced during last week's discussion:
So what causes partitions? Two things, really. The first is obvious – a network failure, for example due to a faulty switch, can cause the network to partition. The other is less obvious, but fits with the definition [...]: machine failures, either hard or soft. In an asynchronous network, i.e. one where processing a message could take unbounded time, it is impossible to distinguish between machine failures and lost messages. Therefore a single machine failure partitions it from the rest of the network. A correlated failure of several machines partitions them all from the network. Not being able to receive a message is the same as the network not delivering it. In the face of sufficiently many machine failures, it is still impossible to maintain availability and consistency, not because two writes may go to separate partitions, but because the failure of an entire ‘quorum’ of servers may render some recent writes unreadable.

(sorry, catching up on old interesting things posted last week...)
failure  scalability  network  partitions  cap  quorum  distributed-databases  fault-tolerance 
may 2013 by jm
RBS collapse details revealed - The Register
as noted in the gossip last week. 'The main batch scheduling software used by RBS is CA-7, said one source, a former RBS employee who left the company recently.' 'RBS do use CA-7 and do update all accounts overnight on a mainframe via thousands of batch jobs scheduled by CA-7 ... Backing out of a failed update to CA-7 really ought to have been a trivial matter for experienced operations and systems programming staff, especially if they knew that an update had been made. That this was not the case tends to imply that the criticisms of the policy to "offshore" also hold some water.'
outsourcing  failure  software  rbs  natwest  ulster-bank  ulster-blank  offshoring  downsizing  ca-7  upgrades 
june 2012 by jm
Microsoft's Azure Feb 29th, 2012 outage postmortem
'The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.' This caused cascading failures throughout the fleet. Ouch -- should have been spotted during code review
azure  dev  dates  leap-years  via:fanf  microsoft  outages  post-mortem  analysis  failure 
march 2012 by jm

related tags

algorithms  analysis  automation  aws  azure  backblaze  ben-maurer  bigtable  bugs  ca-7  cap  cap-theorem  capacity  cassandra  change-management  chaos-monkey  charts  circuit-breakers  codel  concurrency  configuration  consistency  corruption  cubism  data-corruption  dates  debugging  delay  dependencies  deployment  derp  design  dev  devops  disk  distcomp  distributed  distributed-databases  downsizing  error-handling  exception-handling  exercises  facebook  failsafe  failure  fault-tolerance  game-days  gfs  hardware  hashes  hashing  hbase  hdfs  horizon-charts  http  human-interfaces  hystrix  injection  jepsen  leap-years  lifespan  load-balancing  mapreduce  md5  microservices  microsoft  mtbf  natwest  network  networking  offshoring  ops  outages  outsourcing  papers  partition  partitions  pooling  post-mortem  postmortems  proxy  queue-theory  queueing  queues  quorum  rabbitmq  race-conditions  rbs  redis  reliability  resilience  s3  scalability  soa  software  startup  statistics  storage  tcp  testing  threadpools  ui  ulster-bank  ulster-blank  upgrades  uptime  ux  via:allspaw  via:fanf  via:kragen  web 

Copy this bookmark: