jm + post-mortems   13

Air Canada near-miss: Air traffic controllers make split-second decisions in a culture of "psychological safety" — Quartz
“’Just culture’ as a term emerged from air traffic control in the late 1990s, as concern was mounting that air traffic controllers were unfairly cited or prosecuted for incidents that happened to them while they were on the job,” Sidney Dekker, a professor, writer, and director of the Safety Science Innovation Lab at Griffith University in Australia, explains to Quartz in an email. Eurocontrol, the intergovernmental organization that focuses on the safety of airspace across Europe, has “adopted a harmonized ‘just culture’ that it encourages all member countries and others to apply to their air traffic control organizations.”

[...] One tragic example of what can happen when companies don’t create a culture where employees feel empowered to raise questions or admit mistakes came to light in 2014, when an investigation into a faulty ignition switch that caused more than 100 deaths at GM Motors revealed a toxic culture of denying errors and deflecting blame within the firm. The problem was later attributed to one engineer who had not disclosed an obvious issue with the flawed switch, but many employees spoke of extreme pressure to put costs and delivery times before all other considerations, and to hide large and small concerns.

(via JG)
just-culture  atc  air-traffic-control  management  post-mortems  outages  reliability  air-canada  disasters  accidents  learning  psychological-safety  work 
august 2017 by jm
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
Painful to read, but: tl;dr: monitoring oversight, followed by a transient network glitch triggering IPC timeouts, which increased load due to lack of circuit breakers, creating a cascading failure
aws  postmortem  outages  dynamodb  ec2  post-mortems  circuit-breakers  monitoring 
september 2015 by jm
"A Review Of Criticality Accidents, 2000 Revision"
Authoritative report from LANL on accidents involving runaway nuclear reactions over the years from 1945 to 1999, around the world. Illuminating example of how incident post-mortems are handled in other industries, and (of course) fascinating in its own right
criticality  nuclear  safety  atomic  lanl  post-mortems  postmortems  fission 
august 2015 by jm
Inside the sad, expensive failure of Google+
"It was clear if you looked at the per user metrics, people weren’t posting, weren't returning and weren’t really engaging with the product," says one former employee. "Six months in, there started to be a feeling that this isn’t really working." Some lay the blame on the top-down structure of the Google+ department and a leadership team that viewed success as the only option for the social network. Failures and disappointing data were not widely discussed. "The belief was that we were always just one weird feature away from the thing taking off," says the same employee.
google  google+  failures  post-mortems  business  facebook  social-media  fail  bureaucracy  vic-gundotra 
august 2015 by jm
Should Airplanes Be Flying Themselves?
Excellent Vanity Fair article on the AF447 disaster, covering pilots' team-leadership skills, Clipper Skippers, Alternate Law, and autopilot design: 'There is an old truth in aviation that the reasons you get into trouble become the reasons you don’t get out of it.'

Also interesting:

'The best pilots discard the [autopilot] automation naturally when it becomes unhelpful, and again there appear to be some cultural traits involved. Simulator studies have shown that Irish pilots, for instance, will gleefully throw away their crutches, while Asian pilots will hang on tightly. It’s obvious that the Irish are right, but in the real world Sarter’s advice is hard to sell. The automation is simply too compelling. The operational benefits outweigh the costs. The trend is toward more of it, not less. And after throwing away their crutches, many pilots today would lack the wherewithal to walk.'

(via Gavin Sheridan)
airlines  automation  flight  flying  accidents  post-mortems  af447  air-france  autopilot  alerts  pilots  team-leaders  clipper-skippers  alternate-law 
november 2014 by jm
Adrian Cockroft's Cloud Outage Reports Collection
The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. [....] I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them.
outages  post-mortems  documentation  ops  aws  ec2  amazon  google  dropbox  microsoft  azure  incident-response 
march 2014 by jm
Counterfactual Thinking, Rules, and The Knight Capital Accident
John Allspaw with an interesting post on the Knight Capital disaster
john-allspaw  ops  safety  post-mortems  engineering  procedures 
october 2013 by jm
the infamous 2008 S3 single-bit-corruption outage
Neat, I didn't realise this was publicly visible. A single corrupted bit infected the S3 gossip network, taking down the whole S3 service in (iirc) one region:
We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether [gossip state] had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

During our post-mortem analysis we've spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we're taking: (a) we've deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we've deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we've added additional monitoring and alarming of gossip rates and failures; and, (d) we're adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.


This is why you checksum all the things ;)
s3  aws  post-mortems  network  outages  failures  corruption  grey-failures  amazon  gossip 
june 2013 by jm
KDE's brush with git repository corruption: post-mortem
a barely-averted disaster... phew.

while we planned for the case of the server losing a disk or entirely biting the dust, or the total loss of the VM’s filesystem, we didn’t plan for the case of filesystem corruption, and the way the corruption affected our mirroring system triggered some very unforeseen and pathological conditions. [...] the corruption was perfectly mirrored... or rather, due to its nature, imperfectly mirrored. And all data on the anongit [mirrors] was lost.

One risk demonstrated: by trusting in mirroring, rather than a schedule of snapshot backups covering a wide time range, they nearly had a major outage. Silent data corruption, and code bugs, happen -- backups protect against this, but RAID, replication, and mirrors do not.

Another risk: they didn't have a rate limit on project-deletion, which resulted in the "anongit" mirrors deleting their (safe) data copies in response to the upstream corruption. Rate limiting to sanity-check automated changes is vital. What they should have had in place was described by the fix: 'If a new projects file is generated and is more than 1% different than the previous file, the previous file is kept intact (at 1500 repositories, that means 15 repositories would have to be created or deleted in the span of three minutes, which is extremely unlikely).'
rate-limiting  case-studies  post-mortems  kde  git  data-corruption  risks  mirroring  replication  raid  bugs  backups  snapshots  sanity-checks  automation  ops 
march 2013 by jm
Post-mortem for February 24th, 2010 outage - Google App Engine
extremely detailed; power outage in the primary DC resulted in a degraded fleet, and on-calls didn't have up-to-date on-call docs to respond correctly
google  gae  appengine  outages  post-mortems  multi-dc  reliability  distcomp  fleets  on-call  from delicious
march 2010 by jm

Copy this bookmark:



description:


tags: