jm + postmortems   16

A SPOF UPS. There was a similar AZ-wide outage in one of the Amazon DUB datacenters with a similar root cause, if I recall correctly -- supposedly redundant dual UPS systems were in fact interdependent, in that case, and power supply switchover wasn't clean enough to avoid affecting the servers.
Minutes later power was restored was resumed in what one source described as “uncontrolled fashion.” Instead of gradual restore, all power was restored at once resulting in a power surge.   BA CEO Cruz told BBC Radio this power surge  caused network hardware to fail. Also server hardware was damaged because of the power surge.

It seems as if the UPS was the single point of failure for power feed of the IT equipment in Boadicea House . The Times is reporting that the same UPS was powering both Heathrow based datacenters. Which could be a double single point of failure if true (I doubt it is)

The broken network  stopped the exchange of messages between different BA systems and application. Without messaging, there is no exchange of information between various applications. BA is using Progress Software’s Sonic [enterprise service bus].

(via Tony Finch)
postmortems  ba  airlines  outages  fail  via:fanf  datacenters  ups  power  progress  esb  j2ee 
26 days ago by jm
Etsy Debriefing Facilitation Guide
by John Allspaw, Morgan Evans and Daniel Schauenberg; the Etsy blameless postmortem style crystallized into a detailed 27-page PDF ebook
etsy  postmortems  blameless  ops  production  debriefing  ebooks 
november 2016 by jm
Google Cloud Status
Ouch, multi-region outage:
At 14:50 Pacific Time on April 11th, our engineers removed an unused GCE IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network. By itself, this sort of change was harmless and had been performed previously without incident. However, on this occasion our network configuration management software detected an inconsistency in the newly supplied configuration. The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management. In attempting to resolve this inconsistency the network management software is designed to ‘fail safe’ and revert to its current configuration rather than proceeding with the new configuration. However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.

One of our core principles at Google is ‘defense in depth’, and Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations in the event of an upstream failure or bug. These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
multi-region  outages  google  ops  postmortems  gce  cloud  ip  networking  cascading-failures  bugs 
april 2016 by jm
Google tears Symantec a new one on its CA failure
Symantec are getting a crash course in how to conduct an incident post-mortem to boot:
More immediately, we are requesting of Symantec that they further update their public incident report with:
A post-mortem analysis that details why they did not detect the additional certificates that we found.
Details of each of the failures to uphold the relevant Baseline Requirements and EV Guidelines and what they believe the individual root cause was for each failure.
We are also requesting that Symantec provide us with a detailed set of steps they will take to correct and prevent each of the identified failures, as well as a timeline for when they expect to complete such work. Symantec may consider this latter information to be confidential and so we are not requesting that this be made public.
google  symantec  ev  ssl  certificates  ca  security  postmortems  ops 
october 2015 by jm
"A Review Of Criticality Accidents, 2000 Revision"
Authoritative report from LANL on accidents involving runaway nuclear reactions over the years from 1945 to 1999, around the world. Illuminating example of how incident post-mortems are handled in other industries, and (of course) fascinating in its own right
criticality  nuclear  safety  atomic  lanl  post-mortems  postmortems  fission 
august 2015 by jm
A collection of postmortems
A well-maintained list with a potted description of each one (via HN)
postmortems  ops  uptime  reliability 
august 2015 by jm
Mikhail Panchenko's thoughts on the July 2015 CircleCI outage
an excellent followup operational post on CircleCI's "database is not a queue" outage
database-is-not-a-queue  mysql  sql  databases  ops  outages  postmortems 
july 2015 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
Stellar/Ripple suffer a failure of their consensus system, resulting in a split-brain failure
Prof. Mazières’s research indicated some risk that consensus could fail, though we were nor certain if the required circumstances for such a failure were realistic. This week, we discovered the first instance of a consensus failure. On Tuesday night, the nodes on the network began to disagree and caused a fork of the ledger. The majority of the network was on ledger chain A. At some point, the network decided to switch to ledger chain B. This caused the roll back of a few hours of transactions that had only been recorded on chain A. We were able to replay most of these rolled back transactions on chain B to minimize the impact. However, in cases where an account had already sent a transaction on chain B the replay wasn’t possible.
consensus  distcomp  stellar  ripple  split-brain  postmortems  outages  ledger-fork  payment 
december 2014 by jm
Update on Azure Storage Service Interruption
As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services. Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.

I'm really surprised MS deployment procedures allow a change to be rolled out globally across multiple regions on a single day. I suspect they soon won't.
change-management  cm  microsoft  outages  postmortems  azure  deployment  multi-region  flighting  azure-storage 
november 2014 by jm
The Infinite Hows, instead of the Five Whys
John Allspaw with an interesting assertion that we need to ask "how", not "why" in five-whys postmortems:
“Why?” is the wrong question.

In order to learn (which should be the goal of any retrospective or post-hoc investigation) you want multiple and diverse perspectives. You get these by asking people for their own narratives. Effectively, you’re asking “how?“

Asking “why?” too easily gets you to an answer to the question “who?” (which in almost every case is irrelevant) or “takes you to the ‘mysterious’ incentives and motivations people bring into the workplace.”

Asking “how?” gets you to describe (at least some) of the conditions that allowed an event to take place, and provides rich operational data.
ops  five-whys  john-allspaw  questions  postmortems  analysis  root-causes 
november 2014 by jm
Box Tech Blog » A Tale of Postmortems
How Box introduced COE-style dev/ops outage postmortems, and got them working. This PIE metric sounds really useful to head off the dreaded "it'll all have to come out missus" action item:
The picture was getting clearer, and we decided to look into individual postmortems and action items and see what was missing. As it was, action items were wasting away with no owners. Digging deeper, we noticed that many action items entailed massive refactorings or vague requirements like “make system X better” (i.e. tasks that realistically were unlikely to be addressed). At a higher level, postmortem discussions often devolved into theoretical debates without a clear outcome. We needed a way to lower and focus the postmortem bar and a better way to categorize our action items and our technical debt.

Out of this need, PIE (“Probability of recurrence * Impact of recurrence * Ease of addressing”) was born. By ranking each factor from 1 (“low”) to 5 (“high”), PIE provided us with two critical improvements:

1. A way to police our postmortems discussions. I.e. a low probability, low impact, hard to implement solution was unlikely to get prioritized and was better suited to a discussion outside the context of the postmortem. Using this ranking helped deflect almost all theoretical discussions.
2. A straightforward way to prioritize our action items.

What’s better is that once we embraced PIE, we also applied it to existing tech debt work. This was critical because we could now prioritize postmortem action items alongside existing work. Postmortem action items became part of normal operations just like any other high-priority work.
postmortems  action-items  outages  ops  devops  pie  metrics  ranking  refactoring  prioritisation  tech-debt 
august 2014 by jm
DropBox outage post-mortem
A bug in a scheduled OS upgrade script caused live production DB servers to be upgraded while live. Fixes include fixing that script by verifying non-liveness on the host itself, and a faster parallel MySQL binary-log recovery command.
dropbox  outage  postmortems  upgrades  mysql 
january 2014 by jm
Twilio Billing Incident Post-Mortem
At 1:35 AM PDT on July 18, a loss of network connectivity caused all billing redis-slaves to simultaneously disconnect from the master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.
By 2:39 AM PDT the host’s load became so extreme, services relying on redis-master began to fail. At 2:42 AM PDT, our monitoring system alerted our on-call engineering team of a failure in the Redis cluster. Observing extreme load on the host, the redis process on redis-master was misdiagnosed as requiring a restart to recover. This caused redis-master to read an incorrect configuration file, which in turn caused Redis to attempt to recover from a non-existent AOF file, instead of the binary snapshot. As a result of that failed recovery, redis-master dropped all balance data. In addition to forcing recovery from a non-existent AOF, an incorrect configuration also caused redis-master to boot as a slave of itself, putting it in read-only mode and preventing the billing system from updating account balances.

See also for antirez' response.

Here's the takeaways I'm getting from it:

1. network partitions happen in production, and cause cascading failures. this is a great demo of that.

2. don't store critical data in Redis. this was the case for Twilio -- as far as I can tell they were using Redis as a front-line cache for billing data -- but it's worth saying anyway. ;)

3. Twilio were just using Redis as a cache, but a bug in their code meant that the writes to the backing SQL store were not being *read*, resulting in repeated billing and customer impact. In other words, it turned a (fragile) cache into the authoritative store.

4. they should probably have designed their code so that write failures would not result in repeated billing for customers -- that's a bad failure path.

Good post-mortem anyway, and I'd say their customers are a good deal happier to see this published, even if it contains details of the mistakes they made along the way.
redis  caching  storage  networking  network-partitions  twilio  postmortems  ops  billing  replication 
july 2013 by jm
GMail partial outage - Dec 10 2012 incident report [PDF]
TL;DR: a bad load balancer change was deployed globally, causing the impact. 21 minute time to detection. Single-location rollout is now on the cards
gmail  google  coe  incidents  postmortems  outages 
december 2012 by jm
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
John Allspaw's previous slides on Etsy's operations culture -- this'll be old hat to Amazon staff of course ;)
etsy  devops  engineering  operations  reliability  mttd  mttr  postmortems 
march 2012 by jm

Copy this bookmark: