jm + etsy   25

The Push Train
Excellent preso from Dan McKinley on the Etsy-based continuous delivery model, and what he learned trying to apply it after Etsy:
It’s notable that almost all of the hard things we dealt with were social problems. Some of these solutions involved writing code, but the hard part was the human organization. The hard parts in were maintaining a sense of community ownership over the state of the whole system.
etsy  ci  cd  deployment  devops  deploys  dan-mckinley  mcfunley  presentations 
may 2017 by jm
Etsy Debriefing Facilitation Guide
by John Allspaw, Morgan Evans and Daniel Schauenberg; the Etsy blameless postmortem style crystallized into a detailed 27-page PDF ebook
etsy  postmortems  blameless  ops  production  debriefing  ebooks 
november 2016 by jm
Etsy's Release Management process
Good info on how Etsy use their Deployinator tool, end-to-end.

Slide 11: git SHA is visible for each env, allowing easy verification of what code is deployed.

Slide 14: Code is deployed to "princess" staging env while CI tests are running; no need to wait for unit/CI tests to complete.

Slide 23: smoke tests of pre-prod "princess" (complete after 8 mins elapsed).

Slide 31: dashboard link for deployed code is posted during deploy; post-release prod smoke tests are run by Jenkins. (short ones! they complete in 42 seconds)
deployment  etsy  deploy  deployinator  princess  staging  ops  testing  devops  smoke-tests  production  jenkins 
april 2015 by jm
'Continuous Deployment: The Dirty Details'
Good slide deck from Etsy's Mike Brittain regarding their CD setup. Some interesting little-known details:

Slide 41: database schema changes are not CD'd -- they go out on "Schema change Thursdays".

Slide 44: only the webapp is CD'd -- PHP, Apache, memcache components (Etsy.com, support and back-office tools, developer API, gearman async worker queues). The external "services" are not -- databases, Solr/JVM search (rolling restarts), photo storage (filters, proxy cache, S3), payments (PCI-DSS, controlled access).

They avoid schema changes and breaking changes using an approach they call "non-breaking expansions" -- expose new version in a service interface; support multiple versions in the consumer. Example from slides 50-63, based around a database schema migration.

Slide 66: "dev flags" (rollout oriented) are promoted to "feature flags" (long lived degradation control).

Slide 71: some architectural philosophies: deploying is cheap; releasing is cheap; gathering data should be cheap too; treat first iterations as experiments.

Slide 102: "Canary pools". They have multiple pools of users for testing in production -- the staff pool, users who have opted in to see prototypes/beta stuff, 0-100% gradual phased rollout.
cd  deploy  etsy  slides  migrations  database  schema  ops  ci  version-control  feature-flags 
april 2015 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
How Etsy Does Continuous Integration for Mobile Apps
Very impressive. I particularly like the use of Tester Dojos to get through a backlog of unwritten tests -- we had a similar problem recently...
dojos  testing  ci  cd  builds  etsy  mobile  ios  shenzen  trylib  jenkins  tester-dojos 
december 2014 by jm
Calendar Hacks
Some great tips on managing a busy calendar, from Etsy's managers. Block out time; refuse double-booked meetings by default; rely on apps; office hours. Thankfully I have a pretty slim calendar these days, but bookmarking for future use...
calendar  etsy  via:kellan  google  google-calendar  office-hours  life-hacks  hacks  tips  managing  managers  scheduling 
july 2014 by jm
A cautionary tale about building large-scale polyglot systems
'a fucking nightmare':
Cascading requires a compilation step, yet since you're writing Ruby code, you get get none of the benefits of static type checking. It was standard to discover a type issue only after kicking off a job on, oh, 10 EC2 machines, only to have it fail because of a type mismatch. And user code embedded in strings would regularly fail to compile – which you again wouldn't discover until after your job was running. Each of these were bad individually, together, they were a fucking nightmare. The interaction between the code in strings and the type system was the worst of all possible worlds. No type checking, yet incredibly brittle, finicky and incomprehensible type errors at run time. I will never forget when one of my friends at Etsy was learning Cascading.JRuby and he couldn't get a type cast to work. I happened to know what would work: a triple cast. You had to cast the value to the type you wanted, not once, not twice, but THREE times.
etsy  scalding  cascading  adtuitive  war-stories  languages  polyglot  ruby  java  strong-typing  jruby  types  hadoop 
march 2014 by jm
Big, Small, Hot or Cold - Your Data Needs a Robust Pipeline
'(Examples [of big-data B-I crunching pipelines] from Stripe, Tapad, Etsy & Square)'
stripe  tapad  etsy  square  big-data  analytics  kafka  impala  hadoop  hdfs  parquet  thrift 
february 2014 by jm
Don’t get stuck
Good description of Etsy's take on continuous deployment, committing directly to trunk, hidden with feature-flags, from Rafe Colburn
continuous-deployment  coding  agile  deployment  devops  etsy  rafe-colburn 
january 2014 by jm
Introducing Kale « Code as Craft
Etsy have implemented a tool to perform auto-correlation of service metrics, and detection of deviation from historic norms:
at Etsy, we really love to make graphs. We graph everything! Anywhere we can slap a StatsD call, we do. As a result, we’ve found ourselves with over a quarter million distinct metrics. That’s far too many graphs for a team of 150 engineers to watch all day long! And even if you group metrics into dashboards, that’s still an awful lot of dashboards if you want complete coverage. Of course, if a graph isn’t being watched, it might misbehave and no one would know about it. And even if someone caught it, lots of other graphs might be misbehaving in similar ways, and chances are low that folks would make the connection.

We’d like to introduce you to the Kale stack, which is our attempt to fix both of these problems. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.


It'll be interesting to see if they can get this working well. I've found it can be tricky to get working with low false positives, without massive volume to "smooth out" spikes caused by normal activity. Amazon had one particularly successful version driving severity-1 order drop alarms, but it used massive event volumes and still had periodic false positives. Skyline looks like it will alarm on a single anomalous data point, and in the comments Abe notes "our algorithms err on the side of noise and so alerting would be very noisy."
etsy  monitoring  service-metrics  alarming  deviation  correlation  data  search  graphs  oculus  skyline  kale  false-positives 
june 2013 by jm
Measure Anything, Measure Everything « Code as Craft
the classic Etsy pro-metrics "measure everything" post. Some good basic rules and mindset
etsy  monitoring  metrics  stats  ops  devops 
april 2013 by jm
Why did infinite scroll fail at Etsy?
'A/B testing must be done in a modularized fashion. The “fail” case he gave was when Etsy spent months developing and testing infinite scroll to their search listings, only to find that it had a negative impact on engagement.' [...] 'instead of having the goal of “test infinite scroll,” Etsy realized it needed to test each assumption separately, and this going forward is their game plan.'
usability  testing  design  etsy  ab-testing  test  modularization  via:hn 
january 2013 by jm
Dan McKinley :: Effective Web Experimentation as a Homo Narrans
Good demo from Etsy's A/B testing, of how the human brain can retrofit a story onto statistically-insignificant results. To fix: 'avoid building tooling that enables fishing expeditions; limit our post-hoc rationalization by explicitly constraining it before the experiment. Whenever we test a feature on Etsy, we begin the process by identifying metrics that we believe will change if we 1) understand what is happening and 2) get the effect we desire.'
testing  etsy  statistics  a-b-testing  fishing  ulysses-contract  brain  experiments 
january 2013 by jm
Two Sides For Salvation « Code as Craft
Etsy's MySQL master-master pair configuration, and how it allows no-downtime schema changes
database  etsy  mysql  replication  schema  availability  downtime 
december 2012 by jm
Weathering the Unexpected - ACM Queue
Failures happen, and resilience drills help organizations prepare for them.


Good write-up on Google's DiRT (Disaster Recovery Test) procedures, clearly based on Amazon's Gameday exercises. ;) See also http://queue.acm.org/detail.cfm?id=2371297 for a moderated discussion including Jesse Robbins and John Allspaw
game-day  tests  disaster-recovery  dirt  exercises  history  amazon  google  etsy  resilience  acm 
september 2012 by jm
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
John Allspaw's previous slides on Etsy's operations culture -- this'll be old hat to Amazon staff of course ;)
etsy  devops  engineering  operations  reliability  mttd  mttr  postmortems 
march 2012 by jm
Zombie Gnomes Bye Bye Birdie by ChrisandJanesPlace on Etsy
'This is a sorry sight indeed. A poor helpless Lawn Flamingo has been taken down by zombie gnomes: Nose-less Ned, Greedy Gary, and Bartolomeu.It seems like an unlikely kill until Bartolomeu broke the elegant beasts leg and brought it crashing to the ground. Where they pounced upon their helpless victim and began their feast. So we say "Bye Bye Birdie, I'm going to miss you so, Bye Bye Birdie, Why'd you have to go?"' -- bloody hell
etsy  regretsy  funny  odd  flamingo  zombies  gnomes 
february 2012 by jm
Divide and Concur « Code as Craft
Etsy's interesting approach to managing a large test suite, annotations marking potentially troublesome integration tests: "flaky", "database", "network", "sleep" and "slow".
testing  etsy  php  test-suites  annotations  integration-testing 
february 2012 by jm
Turbocharging Solr Index Replication with BitTorrent
Etsy now replicating their multi-GB search index across the search farm using BitTorrent. Why not Multicast? 'multicast rsync caused an epic failure for our network, killing the entire site for several minutes. The multicast traffic saturated the CPU on our core switches causing all of Etsy to be unreachable.' fun!
etsy  multicast  sev1  bittorrent  search  solr  rsync  scaling  outages 
february 2012 by jm
Etsy's metrics infrastructure
I never really understood how useful a good metrics infrastructure could be for operational visibility until I joined Amazon.  Here's a good demo of Etsy's metrics system (via Netlson)
via:nelson  metrics  deployment  change-monitoring  etsy  software  monitoring  ops  from delicious
december 2010 by jm

related tags

a-b-testing  ab-testing  acm  adtuitive  agile  alarming  allspaw  ama  amazon  analytics  annotations  apparel  architecture  availability  big-data  bittorrent  blameless  brain  builds  calendar  cascading  cd  change-monitoring  ci  codeascraft  coding  coes  continuous-deployment  continuousintegration  correlation  culture  dan-mckinley  data  database  debriefing  deploy  deployinator  deployment  deploys  design  deviation  devops  dirt  disaster-recovery  dojos  downtime  ebooks  engineering  etsy  exercises  experiments  false-positives  feature-flags  fishing  five-whys  flamingo  funny  game-day  gnomes  google  google-calendar  graphs  hacks  hadoop  hangry  hdfs  history  impala  incident-response  incidents  infrastructure  integration-testing  ios  java  jenkins  john-allspaw  jruby  kafka  kale  languages  life-hacks  managers  managing  mcfunley  metrics  migrations  mobile  modularization  monitoring  mttd  mttr  multicast  mysql  oculus  odd  office-hours  operations  ops  outages  parquet  php  polyglot  post-mortems  postmortems  presentations  princess  production  rafe-colburn  rc3  reddit  regretsy  reliability  replication  resilience  root-cause  rsync  ruby  scalding  scaling  scheduling  schema  search  service-metrics  sev1  severity  shenzen  skyline  slides  smoke-tests  software  solr  square  staging  statistics  stats  stripe  strong-typing  sysadmin  tapad  techops  tee-shirts  test  test-suites  tester-dojos  testing  tests  thrift  tier-one-support  tips  trylib  types  ulysses-contract  usability  version-control  via:hn  via:kellan  via:nelson  war-stories  zombies 

Copy this bookmark:



description:


tags: