jm + postmortem   6

S3 2017-02-28 outage post-mortem
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.  
s3  postmortem  aws  post-mortem  outages  cms  ops 
march 2017 by jm Database Incident - 2017/01/31
Horrible, horrible postmortem doc. This is the kicker:
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

Reddit comments:
devops  backups  cloud  outage  incidents  postmortem  gitlab 
february 2017 by jm
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
There was a breakdown in communication between the developer who requested the index migration and the database operator who deleted the old index. Instead of working on the migration together, they communicated in an implicit way through flawed tooling. The dashboard that surfaced the migration request was missing important context: the reason for the requested deletion, the dependency on another index’s creation, and the criticality of the index for API traffic. Indeed, the database operator didn’t have a way to check whether the index had recently been used for a query.

Good demo of how the Etsy-style chatops deployment approach would have helped avoid this risk.
stripe  postmortem  outages  databases  indexes  deployment  chatops  deploy  ops 
october 2015 by jm
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
Painful to read, but: tl;dr: monitoring oversight, followed by a transient network glitch triggering IPC timeouts, which increased load due to lack of circuit breakers, creating a cascading failure
aws  postmortem  outages  dynamodb  ec2  post-mortems  circuit-breakers  monitoring 
september 2015 by jm

Copy this bookmark: