post-mortem   274

« earlier     later »

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
Beschreibung warum es die s3 downtime gab.
0. Rechnungserstellung zu langsam
1.Menschlicher Fehler. Zuviele Einheiten auf einmal heruntergefahren.
2.S3 Indexsystem funktionierte nicht mehr
3.S3 funktionierte nicht mehr
4. Systeme die von S3 abhängig sind funktionierten nicht mehr (ec2, lamda,..)

Die grossen Systeme wurden sehr lange nicht komplett neu gestartet. Keine Erfahrungswerte was es da für Probleme gibt -> verlängerte die Downtime
-> Blastradius zu gross
-> zu wenige Sicherheit Features an kritischen Stellen
aws  amazon  outage  s3  post-mortem  arbeit  via:popular 
march 2017 by rauschen
S3 2017-02-28 outage post-mortem
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.  
s3  postmortem  aws  post-mortem  outages  cms  ops 
march 2017 by jm

« earlier    later »

related tags

2016  accident  accidents  advice  allspaw  amazon  analysis  and  apps  arbeit  aws  bailey  banking  best-practices  bgp  bitc  blacklocus  blockchain  blog  brain-dissection  brains  buffer  bug  business  cars  case  clank  clock  clojure  cloud  cloudflare  cms  community  company  contract  culture  database  datacenters  datetime  db  ddos  debriefs  defects  design  develop  development  devops  disaster  dns  downtime  dynamodb  dyndns  energy  engineering  ens  entrepreneurship  error-analysis  ethereum  etsy  existential-types  experience  fail  failosophy  failure  fastmail  firebase  game  games  gaming  gc  github  gitlab  go  golang  gov  governance  guide  hack  howto  hugops  ia  important  incident  info-sec  infosec  inheritance  insomniac  instapaper  internet  iphone  java  jeremy  jvm  kickstarter  leap-second  leap  magazine  management  marketing  medicine  memory  metadiscovery  mistakes  mobile  mysql  name  nanomsg  netapp  networking  ntp  nuclear  open-source  operations  ops  outage  outages  paper  performance  permissions  post  postgres  postmortem  priorities  problem-solving  product-development  product  programming-languages  programming  project-management  quality  rails  ratchet  reddit  regex  reliability  report  rescue  research  rethinkdb  retrospective  rewrite  root-cause-analysis  s3  scaling  second  security  series  skyliner  smackdown  small-business  smart  software  space  square  sre  stackoverflow  startup  startups  study  sync  sysadmin  tech  testing  time  toread  transparency  twitter  type-checking  type-inference  uber  ui  usability  user-error  ux  video  webapps  wikipedia  y!  zeromq 

Copy this bookmark: