postmortem   1956

Google Cloud - issue with Google Cloud Global Loadbalancers returning 502s
On Tuesday, 17 July 2018, from 12:17 to 12:49 PDT, Google Cloud HTTP(S) Load Balancers returned 502s for some requests they received. The proportion of 502 return codes varied from 33% to 87% during the period. Automated monitoring alerted Google’s engineering team to the event at 12:19, and at 12:44 the team had identified the probable root cause and deployed a fix.
Starting Up Security
Collection of links and resources for multiple security topics. Includes threat modeling, risk analysis, postmortems, and more.
Google Cloud Status Dashboard
"Configuration changes being rolled out on the evening of the incident were not applied in the intended order. This resulted in an incomplete configuration change becoming live in some zones, subsequently triggering the failure of customer jobs. During the process of rolling back the configuration, another incorrect configuration change was inadvertently applied, causing the second batch of job failures."
Incident Management at Spotify | Labs
"A few weeks ago Spotify had one of the biggest incidents in the last few years. It caused a major outage for a big chunk of our European users. For a few hours the music playback experience was damaged. Our users would see high latency when playing music and some of them were unable to log in.
Two months before the big outage we had an incident connected with one of our smallest backend services: Popcount. Popcount (this is our internal name) is the service that takes care of storing the list of subscribers for each of our more than 1 billion playlists."
Github/2018-06-28 - Gentoo Wiki

An unknown entity gained control of an admin account for the Gentoo GitHub Organization and removed all access to the organization (and its repositories) from Gentoo developers. They then proceeded to make various changes to content. Gentoo Developers & Infrastructure escalated to GitHub support and the Gentoo Organization was frozen by GitHub staff. Gentoo has regained control of the Gentoo GitHub Organization and has reverted the bad commits and defaced content. "
danluu/post-mortems: A collection of postmortems. Sorry for the delay in merging PRs!
Passengers post-mortem | Ludum Dare
> The less you show, the more you suggest, the more people will fill the gap with their own vision and understanding
25 days ago by nsfmc database incident | GitLab |
Yesterday we had a serious incident with one of our databases. We lost six hours of database data (issues, merge requests, users, comments, snippets, etc.) for Git/wiki repositories and self-hosted installations were not affected. Losing production data is unacceptable and in a few days we'll publish a post on why this happened and a list of measures we will implement to prevent it happening again.
Debriefing Facilitation Guide
A great guide for the facilitator of a debriefing / postmortem / retrospective.
Today we mitigated
"Today, in an effort to reclaim some technical debt, we deployed new code that introduced Gatebot to Provision API.

What we did not account for, and what Provision API didn’t know about, was that and are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.

As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!"
