jm + postmortems + reliability   4

'STELLA Report from the SNAFUcatchers Workshop on Coping With Complexity', March 14-16 2017
'A consortium workshop of high end techs reviewed postmortems to better understand how engineers cope with the complexity of anomalies (SNAFU and SNAFU catching episodes) and how to support them. These cases reveal common themes regarding factors that produce resilient performances. The themes that emerge also highlight opportunities to move forward.'

The 'Dark debt' concept is interesting here.
complexity  postmortems  dark-debt  technical-debt  resilience  reliability  systems  snafu  reports  toread  stella  john-allspaw 
november 2017 by jm
A collection of postmortems
A well-maintained list with a potted description of each one (via HN)
postmortems  ops  uptime  reliability 
august 2015 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
John Allspaw's previous slides on Etsy's operations culture -- this'll be old hat to Amazon staff of course ;)
etsy  devops  engineering  operations  reliability  mttd  mttr  postmortems 
march 2012 by jm

Copy this bookmark:



description:


tags: