jm + post-mortems   15

10-hour Microsoft Azure outage in Europe
Service availability issue in North Europe

Summary of impact: From 17:44 on 19 Jun 2018 to 04:30 UTC on 20 Jun 2018 customers using Azure services in North Europe may have experienced connection failures when attempting to access resources hosted in the region. Customers leveraging a subset of Azure services may have experienced residual impact for a sustained period post-mitigation of the underlying issue. We are communicating with these customers directly in their Management Portal.

Preliminary root cause: Engineers identified that an underlying temperature issue in one of the datacenters in the region triggered an infrastructure alert, which in turn caused a structured shutdown of a subset of Storage and Network devices in this location to ensure hardware and data integrity.

Mitigation: Engineers addressed the temperature issue, and performed a structured recovery of the affected devices and the affected downstream services.


The specific services were: 'Virtual Machines, Storage, SQL Database, Key Vault, App Service, Site Recovery, Automation, Service Bus, Event Hubs, Data Factory, Backup, API management, Log Analytics, Application Insight, Azure Batch Azure Search, Redis Cache, Media Services, IoT Hub, Stream Analytics, Power BI, Azure Monitor, Azure Cosmo DB or Logic Apps in North Europe'. Holy cow
microsoft  outages  fail  azure  post-mortems  cooling-systems  datacenters 
4 weeks ago by jm
Visa admits 5m payments failed over a broken switch
“We operate two redundant data centres in the UK, meaning that either one can independently handle 100% of the transactions for Visa in Europe. In normal circumstances, the systems are synchronised and either centre can take over from the other immediately … in this instance, a component with a switch in our primary data centre suffered a very rare partial failure which prevented the backup switch from activating.”
visa  outages  post-mortems  fail  europe  dcs 
4 weeks ago by jm
Air Canada near-miss: Air traffic controllers make split-second decisions in a culture of "psychological safety" — Quartz
“’Just culture’ as a term emerged from air traffic control in the late 1990s, as concern was mounting that air traffic controllers were unfairly cited or prosecuted for incidents that happened to them while they were on the job,” Sidney Dekker, a professor, writer, and director of the Safety Science Innovation Lab at Griffith University in Australia, explains to Quartz in an email. Eurocontrol, the intergovernmental organization that focuses on the safety of airspace across Europe, has “adopted a harmonized ‘just culture’ that it encourages all member countries and others to apply to their air traffic control organizations.”

[...] One tragic example of what can happen when companies don’t create a culture where employees feel empowered to raise questions or admit mistakes came to light in 2014, when an investigation into a faulty ignition switch that caused more than 100 deaths at GM Motors revealed a toxic culture of denying errors and deflecting blame within the firm. The problem was later attributed to one engineer who had not disclosed an obvious issue with the flawed switch, but many employees spoke of extreme pressure to put costs and delivery times before all other considerations, and to hide large and small concerns.

(via JG)
just-culture  atc  air-traffic-control  management  post-mortems  outages  reliability  air-canada  disasters  accidents  learning  psychological-safety  work 
august 2017 by jm
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
Painful to read, but: tl;dr: monitoring oversight, followed by a transient network glitch triggering IPC timeouts, which increased load due to lack of circuit breakers, creating a cascading failure
aws  postmortem  outages  dynamodb  ec2  post-mortems  circuit-breakers  monitoring 
september 2015 by jm
"A Review Of Criticality Accidents, 2000 Revision"
Authoritative report from LANL on accidents involving runaway nuclear reactions over the years from 1945 to 1999, around the world. Illuminating example of how incident post-mortems are handled in other industries, and (of course) fascinating in its own right
criticality  nuclear  safety  atomic  lanl  post-mortems  postmortems  fission 
august 2015 by jm
Inside the sad, expensive failure of Google+
"It was clear if you looked at the per user metrics, people weren’t posting, weren't returning and weren’t really engaging with the product," says one former employee. "Six months in, there started to be a feeling that this isn’t really working." Some lay the blame on the top-down structure of the Google+ department and a leadership team that viewed success as the only option for the social network. Failures and disappointing data were not widely discussed. "The belief was that we were always just one weird feature away from the thing taking off," says the same employee.
google  google+  failures  post-mortems  business  facebook  social-media  fail  bureaucracy  vic-gundotra 
august 2015 by jm
Should Airplanes Be Flying Themselves?
Excellent Vanity Fair article on the AF447 disaster, covering pilots' team-leadership skills, Clipper Skippers, Alternate Law, and autopilot design: 'There is an old truth in aviation that the reasons you get into trouble become the reasons you don’t get out of it.'

Also interesting:

'The best pilots discard the [autopilot] automation naturally when it becomes unhelpful, and again there appear to be some cultural traits involved. Simulator studies have shown that Irish pilots, for instance, will gleefully throw away their crutches, while Asian pilots will hang on tightly. It’s obvious that the Irish are right, but in the real world Sarter’s advice is hard to sell. The automation is simply too compelling. The operational benefits outweigh the costs. The trend is toward more of it, not less. And after throwing away their crutches, many pilots today would lack the wherewithal to walk.'

(via Gavin Sheridan)
airlines  automation  flight  flying  accidents  post-mortems  af447  air-france  autopilot  alerts  pilots  team-leaders  clipper-skippers  alternate-law 
november 2014 by jm
Adrian Cockroft's Cloud Outage Reports Collection
The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. [....] I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them.
outages  post-mortems  documentation  ops  aws  ec2  amazon  google  dropbox  microsoft  azure  incident-response 
march 2014 by jm
Counterfactual Thinking, Rules, and The Knight Capital Accident
John Allspaw with an interesting post on the Knight Capital disaster
john-allspaw  ops  safety  post-mortems  engineering  procedures 
october 2013 by jm
the infamous 2008 S3 single-bit-corruption outage
Neat, I didn't realise this was publicly visible. A single corrupted bit infected the S3 gossip network, taking down the whole S3 service in (iirc) one region:
We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether [gossip state] had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

During our post-mortem analysis we've spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we're taking: (a) we've deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we've deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we've added additional monitoring and alarming of gossip rates and failures; and, (d) we're adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.


This is why you checksum all the things ;)
s3  aws  post-mortems  network  outages  failures  corruption  grey-failures  amazon  gossip 
june 2013 by jm
KDE's brush with git repository corruption: post-mortem
a barely-averted disaster... phew.

while we planned for the case of the server losing a disk or entirely biting the dust, or the total loss of the VM’s filesystem, we didn’t plan for the case of filesystem corruption, and the way the corruption affected our mirroring system triggered some very unforeseen and pathological conditions. [...] the corruption was perfectly mirrored... or rather, due to its nature, imperfectly mirrored. And all data on the anongit [mirrors] was lost.

One risk demonstrated: by trusting in mirroring, rather than a schedule of snapshot backups covering a wide time range, they nearly had a major outage. Silent data corruption, and code bugs, happen -- backups protect against this, but RAID, replication, and mirrors do not.

Another risk: they didn't have a rate limit on project-deletion, which resulted in the "anongit" mirrors deleting their (safe) data copies in response to the upstream corruption. Rate limiting to sanity-check automated changes is vital. What they should have had in place was described by the fix: 'If a new projects file is generated and is more than 1% different than the previous file, the previous file is kept intact (at 1500 repositories, that means 15 repositories would have to be created or deleted in the span of three minutes, which is extremely unlikely).'
rate-limiting  case-studies  post-mortems  kde  git  data-corruption  risks  mirroring  replication  raid  bugs  backups  snapshots  sanity-checks  automation  ops 
march 2013 by jm
Post-mortem for February 24th, 2010 outage - Google App Engine
extremely detailed; power outage in the primary DC resulted in a degraded fleet, and on-calls didn't have up-to-date on-call docs to respond correctly
google  gae  appengine  outages  post-mortems  multi-dc  reliability  distcomp  fleets  on-call  from delicious
march 2010 by jm

related tags

5-whys  accidents  af447  air-canada  air-france  air-traffic-control  airlines  alerts  alternate-law  amazon  appengine  atc  atomic  automation  autopilot  availability  aws  azure  backups  bugs  bureaucracy  business  case-studies  circuit-breakers  clipper-skippers  coes  cooling-systems  corruption  criticality  culture  data-corruption  datacenters  dcs  disasters  distcomp  documentation  dropbox  dynamodb  ec2  engineering  etsy  europe  facebook  fail  failures  fission  five-whys  fleets  flight  flying  gae  git  google  google+  gossip  grey-failures  howto  incident-response  investigation  john-allspaw  just-culture  kde  lanl  learning  management  microsoft  mirroring  monitoring  multi-dc  network  nuclear  on-call  ops  outages  pilots  post-mortems  postmortem  postmortems  procedures  process  psychological-safety  rafe-colburn  raid  rate-limiting  rc3  reliability  replication  risks  s3  safety  sanity-checks  snapshots  social-media  team-leaders  vic-gundotra  visa  work 

Copy this bookmark:



description:


tags: