jm + cd + ops   3

Global Continuous Delivery with Spinnaker
Netflix' CD platform, post-Atlas. looks interesting
continuous-delivery  aws  netflix  cd  devops  ops  atlas  spinnaker 
november 2015 by jm
Taming Complexity with Reversibility
This is a great post from Kent Beck, putting a lot of recent deployment/rollout patterns in a clear context -- that of supporting "reversibility":
Development servers. Each engineer has their own copy of the entire site. Engineers can make a change, see the consequences, and reverse the change in seconds without affecting anyone else.
Code review. Engineers can propose a change, get feedback, and improve or abandon it in minutes or hours, all before affecting any people using Facebook.
Internal usage. Engineers can make a change, get feedback from thousands of employees using the change, and roll it back in an hour.
Staged rollout. We can begin deploying a change to a billion people and, if the metrics tank, take it back before problems affect most people using Facebook.
Dynamic configuration. If an engineer has planned for it in the code, we can turn off an offending feature in production in seconds. Alternatively, we can dial features up and down in tiny increments (i.e. only 0.1% of people see the feature) to discover and avoid non-linear effects.
Correlation. Our correlation tools let us easily see the unexpected consequences of features so we know to turn them off even when those consequences aren't obvious.
IRC. We can roll out features potentially affecting our ability to communicate internally via Facebook because we have uncorrelated communication channels like IRC and phones.
Right hand side units. We can add a little bit of functionality to the website and turn it on and off in seconds, all without interfering with people's primary interaction with NewsFeed.
Shadow production. We can experiment with new services under real load, from a tiny trickle to the whole flood, without affecting production.
Frequent pushes. Reversing some changes require a code change. On the website we never more than eight hours from the next schedule code push (minutes if a fix is urgent and you are willing to compensate Release Engineering). The time frame for code reversibility on the mobile applications is longer, but the downward trend is clear from six weeks to four to (currently) two.
Data-informed decisions. (Thanks to Dave Cleal) Data-informed decisions are inherently reversible (with the exceptions noted below). "We expect this feature to affect this metric. If it doesn't, it's gone."
Advance countries. We can roll a feature out to a whole country, generate accurate feedback, and roll it back without affecting most of the people using Facebook.
Soft launches. When we roll out a feature or application with a minimum of fanfare it can be pulled back with a minimum of public attention.
Double write/bulk migrate/double read. Even as fundamental a decision as storage format is reversible if we follow this format: start writing all new data to the new data store, migrate all the old data, then start reading from the new data store in parallel with the old.


We do a bunch of these in work, and the rest are on the to-do list. +1 to these!
software  deployment  complexity  systems  facebook  reversibility  dark-releases  releases  ops  cd  migration 
july 2015 by jm
'Continuous Deployment: The Dirty Details'
Good slide deck from Etsy's Mike Brittain regarding their CD setup. Some interesting little-known details:

Slide 41: database schema changes are not CD'd -- they go out on "Schema change Thursdays".

Slide 44: only the webapp is CD'd -- PHP, Apache, memcache components (Etsy.com, support and back-office tools, developer API, gearman async worker queues). The external "services" are not -- databases, Solr/JVM search (rolling restarts), photo storage (filters, proxy cache, S3), payments (PCI-DSS, controlled access).

They avoid schema changes and breaking changes using an approach they call "non-breaking expansions" -- expose new version in a service interface; support multiple versions in the consumer. Example from slides 50-63, based around a database schema migration.

Slide 66: "dev flags" (rollout oriented) are promoted to "feature flags" (long lived degradation control).

Slide 71: some architectural philosophies: deploying is cheap; releasing is cheap; gathering data should be cheap too; treat first iterations as experiments.

Slide 102: "Canary pools". They have multiple pools of users for testing in production -- the staff pool, users who have opted in to see prototypes/beta stuff, 0-100% gradual phased rollout.
cd  deploy  etsy  slides  migrations  database  schema  ops  ci  version-control  feature-flags 
april 2015 by jm

Copy this bookmark:



description:


tags: