jm + rollback   4

Details of the Cloudflare outage on July 2, 2019
Great writeup from jgc. Worth noting some important lessons:

* config changes should be rolled out carefully and gradually, just like code;

* particularly regexps, which are effectively code anyway;

* emergency-use rollback systems need to work, of course!;

* having emergency-only systems is a risk, too, since infrequently-used code paths are likely to atrophy and break without anyone noticing (as nsheridan said);

* /.*/ in a regexp is pretty much always bad news, and would have been worth a linter to catch before commit.
cloudflare  outages  regex  postmortems  regexps  deployment  rollback  via:jgc 
5 weeks ago by jm
Charity Majors responds to the CleverTap Mongo outage war story
This is a great blog post, spot on:
You can’t just go “dudes it’s faster” and jump off a cliff.  This shit is basic.  Test real production workloads. Have a rollback plan.  (Not for *10 days* … try a month or two.)

The only thing I'd nitpick on is that it's all very well to say "buy my book" or "come see me talk at Blahcon", but a good blog post or webpage would be thousands of times more useful.
databases  stateful-services  services  ops  mongodb  charity-majors  rollback  state  storage  testing  dba 
october 2016 by jm
Airflow/AMI/ASG nightly-packaging workflow
Some tantalising discussion on twitter of an Airflow + AMI + ASG workflow for ML packaging:

'We build models using Airflow. We deploy new models as AMIs where each AMI is model + scoring code. The AMI is hence a version of code + model at a point in time : #immutable_infrastructure. It's natural for Airflow to build & deploy the model+code with each Airflow DAG Run corresponding to a versioned AMI. if there's a problem, we can simply roll back to the previous AMI & identify the problematic model building Dag run. Since we use ASGs, Airflow can execute a rolling deploy of new AMIs. We could also have it do a validation & ASG rollback of the AMI if validation fails. Airflow is being used for reliable Model build+validation+deployment.'
ml  packaging  airflow  asg  ami  deployment  ops  infrastructure  rollback 
september 2016 by jm
Nix: The Purely Functional Package Manager
'a powerful package manager for Linux and other Unix systems that makes package management reliable and reproducible. It provides atomic upgrades and rollbacks, side-by-side installation of multiple versions of a package, multi-user package management and easy setup of build environments. '

Basically, this is a third-party open source reimplementation of Amazon's (excellent) internal packaging system, using symlinks to versioned package directories to ensure atomicity and the ability to roll back. This is definitely the *right* way to build packages -- I know what tool I'll be pushing for, next time this question comes up.

See also for a Linux distro built on Nix.
ops  linux  devops  unix  packaging  distros  nix  nixos  atomic  upgrades  rollback  versioning 
september 2014 by jm

Copy this bookmark: