jm + outage   3

GitLab.com Database Incident - 2017/01/31
Horrible, horrible postmortem doc. This is the kicker:
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.


Reddit comments: https://www.reddit.com/r/linux/comments/5rd9em/gitlab_is_down_notes_on_the_incident_and_why_you/
devops  backups  cloud  outage  incidents  postmortem  gitlab 
7 weeks ago by jm
DropBox outage post-mortem
A bug in a scheduled OS upgrade script caused live production DB servers to be upgraded while live. Fixes include fixing that script by verifying non-liveness on the host itself, and a faster parallel MySQL binary-log recovery command.
dropbox  outage  postmortems  upgrades  mysql 
january 2014 by jm
Foursquare MongoDB outage post mortem
MongoDB was set up to write to RAM if possible, omitting immediate writes to disk -- but then the db size exceeded RAM size, the disk was hit, imposing a massive slowdown and creating a huge backlog immediately, bringing the site down (via Nelson)
via:nelson  mongodb  sharding  nosql  ouch  outage  foursquare  sysadmin  ops  from delicious
october 2010 by jm

Copy this bookmark:



description:


tags: