jm + devops   64

Kubernetes Best Practices // Speaker Deck
A lot of these are general Docker/containerisation best practices, too.

(via Devops Weekly)
k8s  kubernetes  devops  ops  containers  docker  best-practices  tips  packaging 
12 weeks ago by jm
Enough with the microservices
Good post!
Much has been written on the pros and cons of microservices, but unfortunately I’m still seeing them as something being pursued in a cargo cult fashion in the growth-stage startup world. At the risk of rewriting Martin Fowler’s Microservice Premium article, I thought it would be good to write up some thoughts so that I can send them to clients when the topic arises, and hopefully help people avoid some of the mistakes I’ve seen. The mistake of choosing a path towards a given architecture or technology on the basis of so-called best practices articles found online is a costly one, and if I can help a single company avoid it then writing this will have been worth it.
architecture  design  microservices  coding  devops  ops  monolith 
may 2017 by jm
The Push Train
Excellent preso from Dan McKinley on the Etsy-based continuous delivery model, and what he learned trying to apply it after Etsy:
It’s notable that almost all of the hard things we dealt with were social problems. Some of these solutions involved writing code, but the hard part was the human organization. The hard parts in were maintaining a sense of community ownership over the state of the whole system.
etsy  ci  cd  deployment  devops  deploys  dan-mckinley  mcfunley  presentations 
may 2017 by jm
Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
Solid article proselytising runbooks/playbooks (or in this article's parlance, "Incident Models") for dev/ops handover and operational knowledge
ops  process  sre  devops  runbooks  playbooks  incident-models 
april 2017 by jm
GitLab.com Database Incident - 2017/01/31
Horrible, horrible postmortem doc. This is the kicker:
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.


Reddit comments: https://www.reddit.com/r/linux/comments/5rd9em/gitlab_is_down_notes_on_the_incident_and_why_you/
devops  backups  cloud  outage  incidents  postmortem  gitlab 
february 2017 by jm
Skyliner
Coda Hale's new gig on how they're using Docker, AWS, etc. I like this: "Use containers. Not too much. Mostly for packaging."
docker  aws  packaging  ops  devops  containers  skyliner 
september 2016 by jm
Introducing Winston
'Event driven Diagnostic and Remediation Platform' -- aka 'runbooks as code'
runbooks  winston  netflix  remediation  outages  mttr  ops  devops 
august 2016 by jm
The Challenges of Container Configuration // Speaker Deck
Some good advice on Docker metadata/config from Gareth Rushgrove
docker  metadata  configuration  build  devops  dev  containers  slidfes 
may 2016 by jm
Dan Luu reviews the Site Reliability Engineering book
voluminous! still looks great, looking forward to reading our copy (via Tony Finch)
via:fanf  books  reading  devops  ops  google  sre  dan-luu 
april 2016 by jm
Wired on the new O'Reilly SRE book
"Site Reliability Engineering: How Google Runs Production Systems", by Chris Jones, Betsy Beyer, Niall Richard Murphy, Jennifer Petoff. Go Niall!
google  sre  niall-murphy  ops  devops  oreilly  books  toread  reviews 
april 2016 by jm
Global Continuous Delivery with Spinnaker
Netflix' CD platform, post-Atlas. looks interesting
continuous-delivery  aws  netflix  cd  devops  ops  atlas  spinnaker 
november 2015 by jm
Why Docker is Not Yet Succeeding Widely in Production
Spot-on points which Docker needs to address. It's still production-ready, and _should_ be used there, it just has significant rough edges...
docker  containers  devops  deployment  releases  linux  ops 
july 2015 by jm
Patterns for building a resilient and scalable microservices platform on AWS
Some good details from Boyan Dimitrov at Hailo, on their orchestration, deployment, provisioning infra they've built
deployment  ops  devops  hailo  microservices  platform  patterns  slides 
may 2015 by jm
Etsy's Release Management process
Good info on how Etsy use their Deployinator tool, end-to-end.

Slide 11: git SHA is visible for each env, allowing easy verification of what code is deployed.

Slide 14: Code is deployed to "princess" staging env while CI tests are running; no need to wait for unit/CI tests to complete.

Slide 23: smoke tests of pre-prod "princess" (complete after 8 mins elapsed).

Slide 31: dashboard link for deployed code is posted during deploy; post-release prod smoke tests are run by Jenkins. (short ones! they complete in 42 seconds)
deployment  etsy  deploy  deployinator  princess  staging  ops  testing  devops  smoke-tests  production  jenkins 
april 2015 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
Concourse
Concourse is a CI system composed of simple tools and ideas. It can express entire pipelines, integrating with arbitrary resources, or it can be used to execute one-off builds, either locally or in another CI system.
ci  concourse-ci  build  deployment  continuous-integration  continuous-deployment  devops 
march 2015 by jm
HP is trying to patent Continuous Delivery
This is appalling bollocks from HP:
On 1st March 2015 I discovered that in 2012 HP had filed a patent (WO2014027990) with the USPO for ‘Performance tests in a continuous deployment pipeline‘ (the patent was granted in 2014). [....] HP has filed several patents covering standard Continuous Delivery (CD) practices. You can help to have these patents revoked by providing ‘prior art’ examples on Stack Exchange.

In fairness, though, this kind of shit happens in most big tech companies. This is what happens when you have a broken software patenting system, with big rewards for companies who obtain shitty troll patents like these, and in turn have companies who reward the engineers who sell themselves out to write up concepts which they know have prior art. Software patents are broken by design!
cd  devops  hp  continuous-deployment  testing  deployment  performance  patents  swpats  prior-art 
march 2015 by jm
Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale | Engineering Blog | Facebook Code
Excellent real-world war story from Facebook -- a long-running mystery bug was eventually revealed to be a combination of edge-case behaviours across all the layers of the networking stack, from L2 link aggregation at the agg-router level, up to the L7 behaviour of the MySQL client connection pool.
Facebook collocates many of a user’s nodes and edges in the social graph. That means that when somebody logs in after a while and their data isn’t in the cache, we might suddenly perform 50 or 100 database queries to a single database to load their data. This starts a race among those queries. The queries that go over a congested link will lose the race reliably, even if only by a few milliseconds. That loss makes them the most recently used when they are put back in the pool. The effect is that during a query burst we stack the deck against ourselves, putting all of the congested connections at the top of the deck.
architecture  debugging  devops  facebook  layer-7  mysql  connection-pooling  aggregation  networking  tcp-stack 
november 2014 by jm
veggiemonk/awesome-docker
A curated list of Docker resources.
linux  sysadmin  docker  ops  devops  containers  hosting 
november 2014 by jm
Is Docker ready for production? Feedbacks of a 2 weeks hands on
I have to agree with this assessment -- there are a lot of loose ends still for production use of Docker in a SOA stack environment:
From my point of view, Docker is probably the best thing I’ve seen in ages to automate a build. It allows to pre build and reuse shared dependencies, ensuring they’re up to date and reducing your build time. It avoids you to either pollute your Jenkins environment or boot a costly and slow Virtualbox virtual machine using Vagrant. But I don’t feel like it’s production ready in a complex environment, because it adds too much complexity. And I’m not even sure that’s what it was designed for.
docker  complexity  devops  ops  production  deployment  soa  web-services  provisioning  networking  logging 
october 2014 by jm
Netflix release new code to production before completing tests
Interesting -- I hadn't heard of this being an official practise anywhere before (although we actually did it ourselves this week)...
If a build has made it [past the 'integration test' phase], it is ready to be deployed to one or more internal environments for user-acceptance testing. Users could be UI developers implementing a new feature using the API, UI Testers performing end-to-end testing or automated UI regression tests. As far as possible, we strive to not have user-acceptance tests be a gating factor for our deployments. We do this by wrapping functionality in Feature Flags so that it is turned off in Production while testing is happening in other environments. 
devops  deployment  feature-flags  release  testing  integration-tests  uat  qa  production  ops  gating  netflix 
october 2014 by jm
Nix: The Purely Functional Package Manager
'a powerful package manager for Linux and other Unix systems that makes package management reliable and reproducible. It provides atomic upgrades and rollbacks, side-by-side installation of multiple versions of a package, multi-user package management and easy setup of build environments. '

Basically, this is a third-party open source reimplementation of Amazon's (excellent) internal packaging system, using symlinks to versioned package directories to ensure atomicity and the ability to roll back. This is definitely the *right* way to build packages -- I know what tool I'll be pushing for, next time this question comes up.

See also nixos.org for a Linux distro built on Nix.
ops  linux  devops  unix  packaging  distros  nix  nixos  atomic  upgrades  rollback  versioning 
september 2014 by jm
Box Tech Blog » A Tale of Postmortems
How Box introduced COE-style dev/ops outage postmortems, and got them working. This PIE metric sounds really useful to head off the dreaded "it'll all have to come out missus" action item:
The picture was getting clearer, and we decided to look into individual postmortems and action items and see what was missing. As it was, action items were wasting away with no owners. Digging deeper, we noticed that many action items entailed massive refactorings or vague requirements like “make system X better” (i.e. tasks that realistically were unlikely to be addressed). At a higher level, postmortem discussions often devolved into theoretical debates without a clear outcome. We needed a way to lower and focus the postmortem bar and a better way to categorize our action items and our technical debt.

Out of this need, PIE (“Probability of recurrence * Impact of recurrence * Ease of addressing”) was born. By ranking each factor from 1 (“low”) to 5 (“high”), PIE provided us with two critical improvements:

1. A way to police our postmortems discussions. I.e. a low probability, low impact, hard to implement solution was unlikely to get prioritized and was better suited to a discussion outside the context of the postmortem. Using this ranking helped deflect almost all theoretical discussions.
2. A straightforward way to prioritize our action items.

What’s better is that once we embraced PIE, we also applied it to existing tech debt work. This was critical because we could now prioritize postmortem action items alongside existing work. Postmortem action items became part of normal operations just like any other high-priority work.
postmortems  action-items  outages  ops  devops  pie  metrics  ranking  refactoring  prioritisation  tech-debt 
august 2014 by jm
Richard Clayton - Failing at Microservices
Solid warts-and-all confessional blogpost about a team failing to implement a microservices architecture. I'd put most of the blame on insufficient infrastructure to support them (at a code level), inter-personal team problems, and inexperience with large-scale complex multi-service production deployment and the work it was going to require
microservices  devops  collaboration  architecture  fail  team  deployment  soa 
august 2014 by jm
Microservices - Not a free lunch! - High Scalability
Some good reasons not to adopt microservices blindly. Testability and distributed-systems complexity are my biggest fears
microservices  soa  devops  architecture  testing  distcomp 
august 2014 by jm
interview with Google VP of SRE Ben Treynor
interviewed by Niall Murphy, no less ;). Some good info on what Google deems important from an ops/SRE perspective
sre  ops  devops  google  monitoring  interviews  ben-treynor 
may 2014 by jm
"Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" [PDF]
Google paper describing the infrastructure they've built for cross-service request tracing (ie. "tracer requests"). Features: low code changes required (since they've built it into the internal protobuf libs), low performance impact, sampling, deployment across the ~entire production fleet, output visibility in minutes, and has been live in production for over 2 years. Excellent read
dapper  tracing  http  services  soa  google  papers  request-tracing  tracers  protobuf  devops 
march 2014 by jm
Trousseau
'an interesting approach to a common problem, that of securely passing secrets around an infrastructure. It uses GPG signed files under the hood and nicely integrates with both version control systems and S3.'

I like this as an approach to securely distributing secrets across a stack of services during deployment. Check in the file of keys, gpg keygen on the server, and add it to the keyfile's ACL during deployment. To simplify, shared or pre-generated GPG keys could also be used.

(via the Devops Weekly newsletter)
gpg  encryption  crypto  secrets  key-distribution  pki  devops  deployment 
february 2014 by jm
deploy_to_runit
A nice node.js app to perform continuous deployment from a GitHub repo via its webhook support, from Matt Sergeant
github  node.js  runit  deployment  git  continuous-deployment  devops  ops 
january 2014 by jm
Don’t get stuck
Good description of Etsy's take on continuous deployment, committing directly to trunk, hidden with feature-flags, from Rafe Colburn
continuous-deployment  coding  agile  deployment  devops  etsy  rafe-colburn 
january 2014 by jm
Moving to Multiple Deployments Per Week at thetrainline.com
continuous delivery using Thoughtworks Go. The UI is quite similar to the internal Amazon C-D system, notably
thoughtworks  continuous-delivery  continuous-deployment  devops  deployment 
december 2013 by jm
Load Balancer Testing with a Honeypot Daemon
nice post on writing BDD unit tests for infrastructure, in this case specifically a load balancer (via Devops Weekly)
load-balancers  ops  devops  sysadmin  testing  unit-tests  networking  honeypot  infrastructure  bdd 
december 2013 by jm
Chef Testing at PagerDuty
Good article on how PagerDuty test their chef changes -- lint, unit tests using ChefSpec, integ tests and their "Failure Friday" game days
testing  chef  ops  devops  chefspec  game-days  pagerduty 
december 2013 by jm
Failure Friday: How We Ensure PagerDuty is Always Reliable
Basically, they run the kind of exercise which Jesse Robbins invented at Amazon -- "Game Days". Scarily, they do these on a Friday -- living dangerously!
game-days  testing  failure  devops  chaos-monkey  ops  exercises 
november 2013 by jm
Why Every Company Needs A DevOps Team Now - Feld Thoughts
Bookmarking particularly for the 3 "favourite DevOps patterns":

"Make sure we have environments available early in the Development process"; enforce a policy that the code and environment are tested together, even at the earliest stages of the project; “Wake up developers up at 2 a.m. when they break things"; and "Create reusable deployment procedures".
devops  work  ops  deployment  testing  pager-duty 
november 2013 by jm
"What Should I Monitor?"
slides (lots of slides) from Baron Schwartz' talk at Velocity in NYC.
slides  monitoring  metrics  ops  devops  baron-schwartz  pdf  capacity 
october 2013 by jm
Docker: Git for deployment
Docker is to deployment as Git is to development.

Developers are able to leverage Git's performance and flexibility when building applications. Git encourages experiments and doesn't punish you when things go wrong: start your experiments in a branch, if things fall down, just git rebase or git reset. It's easy to start a branch and fast to push it.

Docker encourages experimentation for operations. Containers start quickly. Building images is a snap. Using another images as a base image is easy. Deploying whole images is fast, and last but not least, it's not painful to rollback.

Fast + flexible = deployments are about to become a lot more enjoyable.
docker  deployment  sysadmin  ops  devops  vms  vagrant  virtualization  containers  linux  git 
august 2013 by jm
Next Generation Continuous Integration & Deployment with dotCloud’s Docker and Strider
Since Docker treats it’s images as a tree of derivations from a source image, you have the ability to store an image at each stage of a build. This means we can provide full binary images of the environment in which the tests failed. This allows you to run locally bit-for-bit the same container as the CI server ran. Due to the magic of Docker and AUFS Copy-On-Write filesystems, we can store this cheaply.

Often tests pass when built in a CI environment, but when built in another (e.g. production) environment break due to subtle differences. Docker makes it trivial to take exactly the binary environment in which the tests pass, and ship that to production to run it.
docker  strider  continuous-integration  continuous-deployment  deployment  devops  ops  dotcloud  lxc  virtualisation  copy-on-write  images 
july 2013 by jm
Care and Feeding of Large Scale Graphite Installations [slides]
good docs for large-scale graphite use: 'Tip and tricks of using and scaling graphite. First presented at DevOpsDays Austin Texas 2013-05-01'
graphite  devops  ops  metrics  dashboards  sysadmin 
june 2013 by jm
My Philosophy on Alerting
'based on my observations while I was a Site Reliability Engineer at Google.' - by Rob Ewaschuk; very good, and matching the similar recommendations and best practices at Amazon for that matter
monitoring  ops  devops  alerting  alerts  pager-duty  via:jk 
may 2013 by jm
Operations is Dead, but Please Don’t Replace it with DevOps
This is so damn spot on.
Functional silos (and a standalone DevOps team is a great example of one) decouple actions from responsibility. Functional silos allow people to ignore, or at least feel disconnected from, the consequences of their actions. DevOps is a cultural change that encourages, rewards and exposes people taking responsibility for what they do, and what is expected from them. As Werner Vogels from Amazon Web Services says, “you build it, you run it”. So a “DevOps team” is a risky and ultimately doomed strategy. Sure there are some technical roles, specifically related to the enablement of DevOps as an approach and these roles and tools need to be filled and built. Self service platforms, collaboration and communication systems, tool chains for testing, deployment and operations are all necessary. Sure someone needs to deliver on that stuff. But those are specific technical deliverables and not DevOps. DevOps is about people, communication and collaboration. Organizations ignore that at their peril.
devops  teams  work  ops  silos  collaboration  organisations 
may 2013 by jm
Testing Your Automation [slides]
Test-driven infrastructure, using Chef -- slides from Big Ruby 2013. Tools used: foodcritic (lol), Chefspec, minitest-chef-handler, fauxhai, cucumber chef. This is really good to see -- TDD applied to ops. Video at: http://confreaks.com/videos/2309-bigruby2013-testing-your-automation-ttd-for-chef-cookbooks
devops  ops  chef  automation  testing  tdd  infrastructure  provisioning  deployment 
april 2013 by jm
Measure Anything, Measure Everything « Code as Craft
the classic Etsy pro-metrics "measure everything" post. Some good basic rules and mindset
etsy  monitoring  metrics  stats  ops  devops 
april 2013 by jm
12 DevOps anti-patterns
my favourite:

3. Rebrand your ops/dev/any team as the DevOps team

CIO: “I want to embrace DevOps over the coming year.”

MGR: “Already done, we changed the department signage this morning. We are so awesome we now have 2 DevOps teams.”

Yeah great. And I bet you now have lots of “DevOps” engineers walking round too. If you’re lucky they may sit next to each other at lunch.
devops  ops  dev  company  culture  work  antipatterns  engineering 
april 2013 by jm
The first pillar of agile sysadmin: We alert on what we draw
'One of [the] purposes of monitoring systems was to provide data to allow us, as engineers, to detect patterns, and predict issues before they become production impacting. In order to do this, we need to be capturing data and storing it somewhere which allows us to analyse it. If we care about it - if the data could provide the kind of engineering insight which helps us to understand our systems and give early warning - we should be capturing it. ' .... 'There are a couple of weaknesses in [Nagios' design]. Assuming we’ve agreed that if we care about a metric enough to want to alert on it then we should be gathering that data for analysis, and graphing it, then we already have the data upon which to base our check. Furthermore, this data is not on the machine we’re monitoring, so our checks don’t in any way add further stress to that machine.' I would add that if we are alerting on a different set of data from what we collect for graphing, then using the graphs to investigate an alarm may run into problems if they don't sync up.
devops  monitoring  deployment  production  sysadmin  ops  alerting  metrics 
march 2013 by jm
Shell Scripts Are Like Gremlins
Shell Scripts are like Gremlins. You start out with one adorably cute shell script. You commented it and it does one thing really well. It’s easy to read, everyone can use it. It’s awesome! Then you accidentally spill some water on it, or feed it late one night and omgwtf is happening!?


+1. I have to wean myself off the habit of automating with shell scripts where a clean, well-unit-tested piece of code would work better.
shell-scripts  scripting  coding  automation  sysadmin  devops  chef  deployment 
december 2012 by jm
What can data scientists learn from DevOps?
Interesting.

'Rather than continuing to pretend analysis is a one-time, ad hoc action, automate it. [...] you need to maintain the automation machinery, but a cost-benefit analysis will show that the effort rapidly pays off — particularly for complex actions such as analysis that are nontrivial to get right.' (via @fintanr)
via:fintanr  data-science  data  automation  devops  analytics  analysis 
november 2012 by jm
WebTechStacks by martharotter - Kippt
A good set of infrastructure/devops tech blogs, collected by Martha Rotter
via:martharotter  blogs  infrastructure  devops  ops  web  links 
november 2012 by jm
Ansible
'SSH-Based Configuration Management & Deployment'. deploy via SSH; no target-side daemons required. GPLv3 licensed, unfortunately :(
ansible  devops  configuration  deployment  sysadmin  python  ssh 
july 2012 by jm
[tahoe-dev] erasure coding makes files more fragile, not less
Zooko says: "This monitoring and operations engineering is a lot of work!" amen to that
erasure-coding  replicas  fs  tahoe-lafs  zooko  monitoring  devops  ops 
march 2012 by jm
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
John Allspaw's previous slides on Etsy's operations culture -- this'll be old hat to Amazon staff of course ;)
etsy  devops  engineering  operations  reliability  mttd  mttr  postmortems 
march 2012 by jm
Building with Legos
Netflix tech blog on how they deploy their services. Notably, they avoid the Puppet/Chef approach, citing these reasons: 'One is that it eliminates a number of dependencies in the production environment: a master control server, package repository and client scripts on the servers, network permissions to talk to all of these. Another is that it guarantees that what we test in the test environment is the EXACT same thing that is deployed in production; there is very little chance of configuration or other creep/bit rot. Finally, it means that there is no way for people to change or install things in the production environment (this may seem like a really harsh restriction, but if you can build a new AMI fast enough it doesn't really make a difference).'
devops  cloud  aws  netflix  puppet  chef  deployment 
august 2011 by jm
Deployment is just a part of dev/ops cooperation, not the whole thing
metrics, monitoring, instrumentation, fault tolerance, load mitigation called out as other factors by Allspaw
ops  deployment  operations  engineering  metrics  devops  monitoring  fault-tolerance  load  from delicious
december 2009 by jm

related tags

action-items  aggregation  agile  alerting  alerts  allspaw  ama  analysis  analytics  ansible  antipatterns  architecture  asg  atlas  atomic  auto-scaling  automation  aws  backups  baron-schwartz  bdd  ben-treynor  best-practices  blogs  blue-green-deployments  books  bryan-cantrill  bugs  build  capacity  cd  chaos-monkey  chef  chefspec  ci  cloud  cloudnative  codeascraft  coding  collaboration  company  complexity  concourse-ci  configuration  connection-pooling  containers  continuous-delivery  continuous-deployment  continuous-integration  copy-on-write  crypto  culture  dan-luu  dan-mckinley  dapper  dashboards  data  data-science  debugging  delta  deploy  deployinator  deployment  deploys  design  dev  devops  distcomp  distributed  distributed-systems  distros  docker  documentation  dotcloud  ec2  elb  encryption  engineering  erasure-coding  erlang  etsy  exercises  facebook  fail  failure  fault-tolerance  feature-flags  five-whys  flavour-of-the-month  fs  game-days  gating  git  github  gitlab  google  gpg  graphite  hailo  honeypot  hosting  hp  http  images  incident-models  incident-response  incidents  infrastructure  integration-tests  interviews  jenkins  joyent  k8s  key-distribution  kubernetes  lambda  layer-7  links  linux  lists  load  load-balancers  logging  lxc  mcfunley  metadata  metrics  microservices  monitoring  monolith  mttd  mttr  mysql  netflix  networking  niall-murphy  nix  nixos  node.js  operations  ops  oreilly  organisations  outage  outages  packaging  pager-duty  pagerduty  papers  patents  patterns  pdf  performance  pie  pki  platform  playbooks  postmortem  postmortems  presentations  princess  prior-art  prioritisation  process  production  protobuf  provisioning  puppet  python  qa  rafe-colburn  ranking  reading  reddit  refactoring  reference  release  releases  reliability  remediation  replicas  request-tracing  reviews  rollback  root-cause  runbooks  runit  saas  scalability  scripting  secrets  server-side  serverless  services  severity  shell-scripts  shopify  silos  skyliner  slides  slidfes  smoke-tests  soa  spinnaker  sre  ssh  staging  stats  strider  swpats  sysadmin  systems  tahoe-lafs  tcp-stack  tdd  team  teams  tech-debt  techops  testing  thoughtworks  tier-one-support  tips  toread  tracers  tracing  uat  unikernels  unit-tests  unix  upgrades  vagrant  versioning  via:fanf  via:fintanr  via:jk  via:martharotter  via:nelson  virtualisation  virtualization  vms  web  web-services  wiki  winston  work  zooko 

Copy this bookmark:



description:


tags: