postmortem   1956

« earlier    

Google Cloud - issue with Google Cloud Global Loadbalancers returning 502s
On Tuesday, 17 July 2018, from 12:17 to 12:49 PDT, Google Cloud HTTP(S) Load Balancers returned 502s for some requests they received. The proportion of 502 return codes varied from 33% to 87% during the period. Automated monitoring alerted Google’s engineering team to the event at 12:19, and at 12:44 the team had identified the probable root cause and deployed a fix.
google  postmortem 
2 days ago by peakscale
Starting Up Security
Collection of links and resources for multiple security topics. Includes threat modeling, risk analysis, postmortems, and more.
security  reference  documentation  policy  postmortem 
9 days ago by jefframnani
Google Cloud Status Dashboard
"Configuration changes being rolled out on the evening of the incident were not applied in the intended order. This resulted in an incomplete configuration change becoming live in some zones, subsequently triggering the failure of customer jobs. During the process of rolling back the configuration, another incorrect configuration change was inadvertently applied, causing the second batch of job failures."
15 days ago by peakscale
Incident Management at Spotify | Labs
"A few weeks ago Spotify had one of the biggest incidents in the last few years. It caused a major outage for a big chunk of our European users. For a few hours the music playback experience was damaged. Our users would see high latency when playing music and some of them were unable to log in.
Two months before the big outage we had an incident connected with one of our smallest backend services: Popcount. Popcount (this is our internal name) is the service that takes care of storing the list of subscribers for each of our more than 1 billion playlists."
15 days ago by peakscale
Github/2018-06-28 - Gentoo Wiki

An unknown entity gained control of an admin account for the Gentoo GitHub Organization and removed all access to the organization (and its repositories) from Gentoo developers. They then proceeded to make various changes to content. Gentoo Developers & Infrastructure escalated to GitHub support and the Gentoo Organization was frozen by GitHub staff. Gentoo has regained control of the Gentoo GitHub Organization and has reverted the bad commits and defaced content. "
postmortem  security 
17 days ago by peakscale
danluu/post-mortems: A collection of postmortems. Sorry for the delay in merging PRs!
GitHub is where people build software. More than 28 million people use GitHub to discover, fork, and contribute to over 85 million projects.
programming  articles  tech  postmortem  devops  list  collection  fail 
21 days ago by e2b
Passengers post-mortem | Ludum Dare
> The less you show, the more you suggest, the more people will fill the gap with their own vision and understanding
PICO8  games  culture  design  gamedesign  postmortem 
25 days ago by nsfmc database incident | GitLab |
Yesterday we had a serious incident with one of our databases. We lost six hours of database data (issues, merge requests, users, comments, snippets, etc.) for Git/wiki repositories and self-hosted installations were not affected. Losing production data is unacceptable and in a few days we'll publish a post on why this happened and a list of measures we will implement to prevent it happening again.
backupandrecovery  backups  sysadmin  dba  database  fail  postmortem 
6 weeks ago by kme
Debriefing Facilitation Guide
A great guide for the facilitator of a debriefing / postmortem / retrospective.
etsy  debriefing  postmortem  retrospective  facilitation 
7 weeks ago by drmeme
Today we mitigated
"Today, in an effort to reclaim some technical debt, we deployed new code that introduced Gatebot to Provision API.

What we did not account for, and what Provision API didn’t know about, was that and are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.

As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!"
postmortem  security  networks 
7 weeks ago by peakscale

« earlier    

related tags

2016-election  2016  2017  3d  @4  accident  adaptivecapacitylabs  advertising  allspaw  analysis  android  apple  article  articles  backupandrecovery  backups  bank  bestpractices  bgp  blame  blameless  breach  business  case-study  case  cassandra  cdn  characters  chart  christianity  class  classic  clinton  cloud  coalitions  collection  communication  complexity  consulting  containers  controls  correlation  cost-benefit  crosstab  culture  data  database  day  dba  debriefing  debug  debugging  demographics  descriptive  design  development  devops  distributed  distributedsystems  diversity  docker  documentation  economics  education  elections  email  emotion  engineering  eos  equant  error  etcd  ethics  etsy  europe  every  facilitation  fail  failure  flux-stasis  frame  from:polygon  game_development  gamedesign  gamedev  games  gandi  gdb  github  go  golang  google  graphs  hahaonlyserious  homo-hetero  hosting  humanfactors  identity-politics  impetus  important  incident-response  incident  incidentrespomse  incidentresponse  incidents  indigo-prophecy  infosec  internet  ios  ios11  iosdev  iot  iphonex  islam  joyent  kafka  kubernetes  labor  lag  language  linkerd  list  management  mattgertz  media  melbourne  metrics  mfa  microservices  microsoft  migration  monzo  mortem  mustread  nationalism-globalism  networking  networks  nordhagen  obama  observability  opensource  operations  ops  org:ngo  outage  outages  pagerduty  painting  payment  performance  phalanges  pico8  policy  polisci  politics  poll  post-mortem  post  postgres  postgresql  postgresql_tips  prediction  process  programming  pwa  pypi  python  quality  quantic-dream  race  ranking  redistribution  reference  reliability  religion  rendering  report  retro  retrospective  review  roots  routing  saas  safety  security  session  severity  snafucatchers  software-development  speed  startup  stella  story  study  stylized-facts  support  supporting  sysadmin  systems  tastes  taxes  tech  technology  tony  tools  trade  troubleshoot  trump  two  ui  usa  uselections  values  vista  visualization  vr  water  web  webdev  welfare-state  windows  wine  wonkish  zedtown  zhou 

Copy this bookmark: