jm + ops   462

Learning to operate Kubernetes reliably
A very solid writeup from Julia "b0rk" Evans at Stripe
stripe  kubernetes  cron  distributed-cron  jobs  docker  containers  ops  julia-evans 
4 weeks ago by jm
'Simple uptime monitoring: distributed, self-hosted health checks and status pages' -- stores in S3
go  ops  monitoring  uptime  health-checks  status-pages  status  golang  s3 
4 weeks ago by jm
'The missing link between AWS AutoScaling Groups and Route53 [...] solves the issue of keeping a route53 zone up to date with the changes that an autoscaling group might face.'
auto53  route-53  dns  aws  amazon  ops  hostnames  asg  autoscaling 
5 weeks ago by jm
AWS re:invent 2017: Container Networking Deep Dive with Amazon ECS (CON401) // Practical Applications
Another re:Invent highlight to watch -- ECS' new native container networking model explained
reinvent  aws  containers  docker  ecs  networking  sdn  ops 
6 weeks ago by jm
Introducing the Amazon Time Sync Service
Well overdue; includes Google-style leap smearing
time-sync  time  aws  services  ntp  ops 
7 weeks ago by jm
Introducing AWS Fargate – Run Containers without Managing Infrastructure
now that's a good announcement. Available right away running atop ECS; EKS in 2018
eks  ecs  fargate  aws  services  ops  containers  docker 
7 weeks ago by jm
'A cure for Cron's chronic email problem'
cron  linux  unix  ops  sysadmin  mail 
october 2017 by jm
IBM broke its cloud by letting three domain names expire - The Register
“multiple domain names were mistakenly allowed to expire and were in hold status.”
outages  fail  ibm  the-register  ops  dns  domains  cloud 
october 2017 by jm
'AWS Lambda cheatsheet' -- a quick ref card for Lambda users
aws  lambda  ops  serverless  reference  quick-references 
october 2017 by jm
How to operate reliable AWS Lambda applications in production
running a reliable Lambda application in production requires you to still follow operational best practices. In this article I am including some recommendations, based on my experience with operations in general as well as working with AWS Lambda.
aws  cloud  lambda  ops  amazon 
october 2017 by jm
S3 Point In Time Restore
restore a versioned S3 bucket to the state it was at at a specific point in time
ops  s3  restore  backups  versioning  history  tools  scripts  unix 
october 2017 by jm
Share scripts that have dependencies with Nix
Nice approach to one-liner packaging invocations using nix-shell
nix  packaging  unix  linux  ops  shebang  #! 
october 2017 by jm
HN thread on the new Network Load Balancer AWS product
looks like @colmmacc works on it. Lots and lots of good details here
nlb  aws  load-balancing  ops  architecture  lbs  tcp  ip 
september 2017 by jm
Going Multi-Cloud with AWS and GCP: Lessons Learned at Scale
Metamarkets splits across AWS and GCP, going into heavy detail here
aws  gcp  google  ops  hosting  multi-cloud 
august 2017 by jm
Linux Load Averages: Solving the Mystery
Nice bit of OS archaeology by Brendan Gregg.
In 1993, a Linux engineer found a nonintuitive case with load averages, and with a three-line patch changed them forever from "CPU load averages" to what one might call "system load averages." His change included tasks in the uninterruptible state, so that load averages reflected demand for disk resources and not just CPUs. These system load averages count the number of threads working and waiting to work, and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves.
load  monitoring  linux  unix  performance  ops  brendan-gregg  history  cpu 
august 2017 by jm
Arq Backs Up To B2!
Arq backup for OSX now supports B2 (as well as S3) as a storage backend.
"it’s a super-cheap option ($.005/GB per month) for storing your backups." (that is less than half the price of $0.0125/GB for S3's Infrequent Access class)
s3  storage  b2  backblaze  backups  arq  macosx  ops 
august 2017 by jm
Working with multiple AWS accounts at Ticketea
AWS STS/multiple account best practice described
sts  aws  authz  ops  ticketea  dev 
august 2017 by jm
AWS Lambda Deployment using Terraform – Build ACL – Medium
Fairly persuasive that production usage of Lambda is much easier if you go full Terraform to manage and deploy.
A complete picture of what it takes to deploy your Lambda function to production with the same diligence you apply to any other codebase using Terraform. [...] There are many cases where frameworks such as SAM or Serverless are not enough. You need more than that for a highly integrated Lambda function. In such cases, it’s easier to simply use Terraform.
infrastructure  aws  lambda  serverless  ops  terraform  sam 
august 2017 by jm
Nextflow - A DSL for parallel and scalable computational pipelines
Data-driven computational pipelines

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.

GPLv3 licensed, open source
computation  workflows  pipelines  batch  docker  ops  open-source 
august 2017 by jm
EBS gp2 I/O BurstBalance exhaustion
when EBS volumes in EC2 exhaust their "burst" allocation, things go awry very quickly
performance  aws  ebs  ec2  burst-balance  ops  debugging 
july 2017 by jm
Kubernetes Best Practices // Speaker Deck
A lot of these are general Docker/containerisation best practices, too.

(via Devops Weekly)
k8s  kubernetes  devops  ops  containers  docker  best-practices  tips  packaging 
july 2017 by jm
Amazon Web Services Elastic Compute Cloud (EC2) Rescue for Linux is a python-based tool that allows for the automatic diagnosis of common problems found on EC2 Linux instances.

Most of the modules appear to be log-greppers looking for common kernel issues.
ec2  aws  kernel  linux  ec2rl  ops 
july 2017 by jm
Wifi AP Placement [video]
'AP Placement - A Job For the Work Experience Kid? | Scott Stapleton | WLPC EU Budapest 2016'
ap  wifi  placement  layout  ops  wireless  home  presos 
july 2017 by jm
OVH suffer 24-hour outage (The Register)
Choice quotes:

‘At 6:48pm, Thursday, June 29, in Room 3 of the P19 datacenter, due to a crack on a soft plastic pipe in our water-cooling system, a coolant leak causes fluid to enter the system';
‘This process had been tested in principle but not at a 50,000-website scale’
postmortems  ovh  outages  liquid-cooling  datacenters  dr  disaster-recovery  ops 
july 2017 by jm
Fastest syncing of S3 buckets
good tip for "aws s3 sync" performance
performance  aws  s3  copy  ops  tips 
july 2017 by jm
Scheduled Tasks (cron) - Amazon EC2 Container Service
ECS now does cron jobs. But where does AWS Batch fit in? confusing
aws  batch  ecs  cron  scheduling  recurrence  ops 
july 2017 by jm
Top 5 ways to improve your AWS EC2 performance
A couple of bits of excellent advice from Datadog (although this may be a slightly old post, from Oct 2016):

1. Unpredictable EBS disk I/O performance. Note that gp2 volumes do not appear to need as much warmup or priming as before.

2. EC2 Instance ECU Mismatch and Stolen CPU. advice: use bigger instances

The other 3 ways are a little obvious by comparison, but worth bookmarking for those two anyway.
ops  ec2  performance  datadog  aws  ebs  stolen-cpu  virtualization  metrics  tips 
july 2017 by jm
How Did I “Hack” AWS Lambda to Run Docker Containers?
Running Docker containers in Lambda using a usermode-docker hack -- hacky as hell but fun ;) Lambda should really support native Docker though
docker  lambda  aws  serverless  ops  hacks  udocker 
june 2017 by jm
Open Guide to Amazon Web Services
'A lot of information on AWS is already written. Most people learn AWS by reading a blog or a “getting started guide” and referring to the standard AWS references. Nonetheless, trustworthy and practical information and recommendations aren’t easy to come by. AWS’s own documentation is a great but sprawling resource few have time to read fully, and it doesn’t include anything but official facts, so omits experiences of engineers. The information in blogs or Stack Overflow is also not consistently up to date. This guide is by and for engineers who use AWS. It aims to be a useful, living reference that consolidates links, tips, gotchas, and best practices. It arose from discussion and editing over beers by several engineers who have used AWS extensively.'
amazon  aws  guides  documentation  ops  architecture 
june 2017 by jm
usl4j And You |
Coda Hale wrote a handy java library implementing a USL solver
usl  scalability  java  performance  optimization  benchmarking  measurement  ops  coda-hale 
june 2017 by jm
Scaling Amazon Aurora at ticketea
Ticketing is a business in which extreme traffic spikes are the norm, rather than the exception. For Ticketea, this means that our traffic can increase by a factor of 60x in a matter of seconds. This usually happens when big events (which have a fixed, pre-announced 'sale start time') go on sale.
scaling  scalability  ops  aws  aurora  autoscaling  asg 
may 2017 by jm
Enough with the microservices
Good post!
Much has been written on the pros and cons of microservices, but unfortunately I’m still seeing them as something being pursued in a cargo cult fashion in the growth-stage startup world. At the risk of rewriting Martin Fowler’s Microservice Premium article, I thought it would be good to write up some thoughts so that I can send them to clients when the topic arises, and hopefully help people avoid some of the mistakes I’ve seen. The mistake of choosing a path towards a given architecture or technology on the basis of so-called best practices articles found online is a costly one, and if I can help a single company avoid it then writing this will have been worth it.
architecture  design  microservices  coding  devops  ops  monolith 
may 2017 by jm
hosted status page / downtime banner service
banners  web  status  uptime  downtime  ops  reliability 
may 2017 by jm
Spotting a million dollars in your AWS account · Segment Blog
You can easily split your spend by AWS service per month and call it a day. Ten thousand dollars of EC2, one thousand to S3, five hundred dollars to network traffic, etc. But what’s still missing is a synthesis of which products and engineering teams are dominating your costs. 

Then, add in the fact that you may have hundreds of instances and millions of containers that come and go. Soon, what started as simple analysis problem has quickly become unimaginably complex. 

In this follow-up post, we’d like to share details on the toolkit we used. Our hope is to offer up a few ideas to help you analyze your AWS spend, no matter whether you’re running only a handful of instances, or tens of thousands.

segment  money  costs  billing  aws  ec2  ecs  ops 
may 2017 by jm

A script and python module to check your AWS service limits and usage, and warn when usage approaches limits.

Users building out scalable services in Amazon AWS often run into AWS' service limits - often at the least convenient time (i.e. mid-deploy or when autoscaling fails). Amazon's Trusted Advisor can help this, but even the version that comes with Business and Enterprise support only monitors a small subset of AWS limits and only alerts weekly. awslimitchecker provides a command line script and reusable package that queries your current usage of AWS resources and compares it to limits (hard-coded AWS defaults that you can override, API-based limits where available, or data from Trusted Advisor where available), notifying you when you are approaching or at your limits.

(via This Week in AWS)
aws  amazon  limits  scripts  ops 
may 2017 by jm
cristim/autospotting: Pay up to 10 times less on EC2 by automatically replacing on-demand AutoScaling group members with similar or larger identically configured spot instances.
A simple and easy to use tool designed to significantly lower your Amazon AWS costs by automating the use of the spot market.

Once enabled on an existing on-demand AutoScaling group, it launches an EC2 spot instance that is cheaper, at least as large and configured identically to your current on-demand instances. As soon as the new instance is ready, it is added to the group and an on-demand instance is detached from the group and terminated.

It continuously applies this process, gradually replacing any on-demand instances with spot instances until the group only consists of spot instances, but it can also be configured to keep some on-demand instances running.
aws  golang  ec2  autoscaling  asg  spot-instances  ops 
may 2017 by jm
acksin/seespot: AWS Spot instance health check with termination and clean up support
When a Spot Instance is about to terminate there is a 2 minute window before the termination actually happens. SeeSpot is a utility for AWS Spot instances that handles the health check. If used with an AWS ELB it also handles cleanup of the instance when a Spot Termination notice is sent.
aws  elb  spot-instances  health-checks  golang  lifecycle  ops 
may 2017 by jm
'FREE WiFi Site Survey Software for MAC OS X & Windows'.
Sadly reviews from pals are that it is 'shite' :(
osx  wifi  network  survey  netspot  networking  ops  dataviz  wireless 
april 2017 by jm
Julia Evans on Twitter: "notes on this great "When the pager goes off" article"
'notes on this great "When the pager goes off" article from @incrementmag ' -- cartoon summarising a much longer article of common modern ops on-call response techniques. Still pretty consistent with the systems we used in Amazon
on-call  ops  incident-response  julia-evans  pager  increment-mag 
april 2017 by jm
Ubuntu on AWS gets serious performance boost with AWS-tuned kernel
interesting -- faster boots, CPU throttling resolved on t2.micros, other nice stuff
aws  ubuntu  ec2  kernel  linux  ops 
april 2017 by jm
Spotify’s Love/Hate Relationship with DNS
omg somebody at Spotify really really loves DNS. They even store a DHT hash ring in it. whyyyyyyyyyyy
spotify  networking  architecture  dht  insane  scary  dns  unbound  ops 
april 2017 by jm
Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
Solid article proselytising runbooks/playbooks (or in this article's parlance, "Incident Models") for dev/ops handover and operational knowledge
ops  process  sre  devops  runbooks  playbooks  incident-models 
april 2017 by jm
Deep Dive on Amazon EBS Elastic Volumes
'March 2017 AWS Online Tech Talks' -- lots about the new volume types
aws  ebs  storage  architecture  ops  slides 
march 2017 by jm
Learn redis the hard way (in production) · trivago techblog
oh god this is pretty awful. this just reads like "don't try to use Redis at scale" to me
redis  scalability  ops  architecture  horror  trivago  php 
march 2017 by jm
Top for containers (ie Docker)
docker  containers  top  ops  go  monitoring  cpu 
march 2017 by jm
How to stop Ubuntu Xenial (16.04) from randomly killing your big processes
Unfortunately, a bug was recently introduced into the allocator which made it sometimes not try hard enough to free kernel cache memory before giving up and invoking the OOM killer. In practice, this means that at random times, the OOM killer would strike at big processes when the kernel tries to allocate, say, 16 kilobytes of memory for a new process’s thread stack — even when there are many gigabytes of memory in reclaimable kernel caches!
oom-killer  ooms  linux  ops  16.04 
march 2017 by jm
Annotated tenets of SRE
A google SRE annotates the Google SRE book with his own thoughts. The source material is great, but the commentary improves it alright.

Particularly good for the error budget concept.

Also: when did "runbooks" become "playbooks"? Don't particularly care either way, but needless renaming is annoying.
runbooks  playbooks  ops  google  sre  error-budget 
march 2017 by jm
The Occasional Chaos of AWS Lambda Runtime Performance
If our code has modest resource requirements, and can tolerate large changes in performance, then it makes sense to start with the least amount of memory necessary. On the other hand, if consistency is important, the best way to achieve that is by cranking the memory setting all the way up to 1536MB.
It’s also worth noting here that CPU-bound Lambdas may be cheaper to run over time with a higher memory setting, as Jim Conning describes in his article, “AWS Lambda: Faster is Cheaper”. In our tests, we haven’t seen conclusive evidence of that behavior, but much more data is required to draw any strong conclusions.
The other lesson learned is that Lambda benchmarks should be gathered over the course of days, not hours or minutes, in order to provide actionable information. Otherwise, it’s possible to see very impressive performance from a Lambda that might later dramatically change for the worse, and any decisions made based on that information will be rendered useless.
aws  lambda  amazon  performance  architecture  ops  benchmarks 
march 2017 by jm
S3 2017-02-28 outage post-mortem
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.  
s3  postmortem  aws  post-mortem  outages  cms  ops 
march 2017 by jm
"I caused an outage" thread on twitter
Anil Dash: "What was the first time you took the website down or broke the build? I’m thinking of all the inadvertent downtime that comes with shipping."

Sample response: 'Pushed a fatal error in lib/display.php to all of FB’s production servers one Friday night in late 2005. Site loaded blank pages for 20min.'
outages  reliability  twitter  downtime  fail  ops  post-mortem 
march 2017 by jm
Gravitational Teleport
Teleport enables teams to easily adopt the best SSH practices like:

Integrated SSH credentials with your organization Google Apps identities or other OAuth identity providers.
No need to distribute keys: Teleport uses certificate-based access with automatic expiration time.
Enforcement of 2nd factor authentication.
Cluster introspection: every Teleport node becomes a part of a cluster and is visible on the Web UI.
Record and replay SSH sessions for knowledge sharing and auditing purposes.
Collaboratively troubleshoot issues through session sharing.
Connect to clusters located behind firewalls without direct Internet access via SSH bastions.
ssh  teleport  ops  bastions  security  auditing  oauth  2fa 
february 2017 by jm
How-to Debug a Running Docker Container from a Separate Container
arguably this shouldn't be required -- building containers without /bin/sh, strace, gdb etc. is just silly
strace  docker  ops  debugging  containers 
february 2017 by jm
10 Most Common Reasons Kubernetes Deployments Fail
some real-world failure cases and how to fix them
kubernetes  docker  ops 
february 2017 by jm
Instapaper Outage Cause & Recovery
Hard to see this as anything other than a pretty awful documentation fail by the AWS RDS service:
Without knowledge of the pre-April 2014 file size limit, it was difficult to foresee and prevent this issue. As far as we can tell, there’s no information in the RDS console in the form of monitoring, alerts or logging that would have let us know we were approaching the 2TB file size limit, or that we were subject to it in the first place. Even now, there’s nothing to indicate that our hosted database has a critical issue.
limits  aws  rds  databases  mysql  filesystems  ops  instapaper  risks 
february 2017 by jm
Spotify's read-only k/v store
spotify  sparkey  read-only  key-value  storage  ops  architecture 
february 2017 by jm
'shift is a [web] application that helps you run schema migrations on MySQL databases'
databases  mysql  sql  migrations  ops  square  ddl  percona 
february 2017 by jm
A server with 24 years of uptime
wow. Stratus fault-tolerant systems ftw.

'This is a fault tolerant server, which means that hardware components are redundant. Over the years, disk drives, power supplies and some other components have been replaced but Hogan estimates that close to 80% of the system is original.'

(via internetofshit, which this isn't)
stratus  fault-tolerance  hardware  uptime  records  ops 
january 2017 by jm
Google - Site Reliability Engineering
The Google SRE book is now online, for free
sre  google  ops  books  reading 
january 2017 by jm
PagerDuty Incident Response Documentation
This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).

This is a really good set of processes -- quite similar to what we used in Amazon for high-severity outage response.
ops  process  outages  pagerduty  incident-response  incidents  on-call 
january 2017 by jm
Leap Smear  |  Public NTP  |  Google Developers
Google offers public NTP service with leap smearing -- I didn't realise! (thanks Keith)
google  clocks  time  ntp  leap-smearing  leap-second  ops 
january 2017 by jm
AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205) - YouTube
Yelp talk about their Spot Fleet price optimization autoscaler app, FleetMiser
yelp  scaling  aws  spot-fleet  ops  spot-instances  money 
december 2016 by jm
Auto scaling Pinterest
notes on a second-system take on autoscaling -- Pinterest tried it once, it didn't take, and this is the rerun. I like the tandem ASG approach (spots and nonspots)
spot-instances  scaling  aws  scalability  ops  architecture  pinterest  via:highscalability 
december 2016 by jm
Syscall Auditing at Scale
auditd -> go-audit -> elasticsearch at Slack
elasticsearch  auditd  syscalls  auditing  ops  slack 
november 2016 by jm
Etsy Debriefing Facilitation Guide
by John Allspaw, Morgan Evans and Daniel Schauenberg; the Etsy blameless postmortem style crystallized into a detailed 27-page PDF ebook
etsy  postmortems  blameless  ops  production  debriefing  ebooks 
november 2016 by jm
« earlier      
per page:    204080120160

related tags

#!  2fa  10/8  16.04  32bit  accept  accidents  accounts  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  advice  agpl  airbnb  airflow  aix  alarm-fatigue  alarming  alarms  alb  alert-logic  alerting  alerts  alestic  algorithms  allspaw  alter-table  ama  amazon  ami  analysis  analytics  anomaly-detection  antarctica  anti-spam  antipatterns  anycast  ap  apache  aphyr  api  api-gateway  apis  app-engine  apt  archaius  architecture  arq  asg  asgard  aspirations  assembly  atlas  atomic  auditd  auditing  aufs  aurora  authentication  authz  auto-remediation  auto-scaling  auto53  automation  autoremediation  autoscaling  availability  aws  awsume  az  azure  b2  backblaze  background  backlog  backpressure  backup  backups  banking  banners  bare-metal  baron-schwartz  bash  basho  bastions  batch  bbc  bdb  bdb-je  bdd  beanstalk  ben-maurer  ben-treynor  benchmarking  benchmarks  best-practices  big-data  billing  bind  bit-errors  bitcoin  bitly  bitrot  blake2  blameless  bloat  blockdev  blogs  blue-green-deployments  blue-green-deploys  books  boot2docker  borg  boundary  bpf  brendan-gregg  bridge  broadcast  bryan-cantrill  bsd  btrfs  bugs  build  build-out  building  bureaucracy  burst-balance  byteman  c  ca  ca-7  caches  caching  calico  campaigns  canaries  canary-requests  cap  cap-theorem  capacity  carbon  cascading-failures  case-studies  cassandra  cd  cdn  censum  certificates  certs  cfengine  cgroups  change-management  change-monitoring  changes  chaos-kong  chaos-monkey  charity-majors  charts  chatops  checkip  checklists  chef  chefspec  china  chronos  ci  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cleanup  cli  clocks  clos-networks  cloud  cloud-storage  cloudera  cloudflare  cloudfront  cloudnative  cloudwatch  cluster  clustering  clusters  cms  coda-hale  code-spaces  codeascraft  codedeploy  codel  coding  coes  coinbase  cold  collaboration  command-line  commandline  commercial  company  compatibility  complexity  compression  computation  concurrency  conferences  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy  copy-on-write  copyright  coreos  coreutils  corruption  costs  counting  coursera  cp  cpu  crash-only-software  credentials  critiques  criu  cron  cross-region  crypto  cubism  culture  curl  daemon  daemons  dan-luu  danilop  dark-releases  dashboards  data  data-centers  data-corruption  data-loss  database  database-is-not-a-queue  databases  datacenter  datacenters  datadog  dataviz  datawire  dba  dbus  ddl  debian  debriefing  debug  debugging  decay  defrag  delete  delivery  delta  demo  dependencies  deploy  deployinator  deployment  derp  design  desktops  dev  developers  development  deviance  devops  dht  diagnosis  digital-ocean  disaster-recovery  disk  disk-space  disks  distcomp  distributed  distributed-cron  distributed-systems  distros  diy  dmca  dns  dnsmasq  docker  documentation  domains  dotcloud  downtime  dr  drivers  dropbox  dstat  duplicity  duply  dynalite  dynamic  dynamic-configuration  dynamodb  dynect  ebooks  ebs  ec2  ec2rl  ecs  efficiency  eks  elastic-scaling  elasticache  elasticsearch  elb  email  emr  emrfs  encryption  engineering  ensemble  environments  erasure-coding  ergonomics  error-budget  etcd  etl  etsy  eureka  ev  event-management  eventual-consistency  exception-handling  exercises  exponential-decay  ext3  ext4  extortion  fabric  fabrics  facebook  facette  fail  failover  failure  false-positives  fargate  fault-tolerance  fcron  feature-flags  fedora  file-transfer  filesystems  fincore  firefighting  five-whys  flapjack  flavour-of-the-month  flock  flow-logs  forecasting  foursquare  freebsd  front-ends  frontline  fs  fsync  ftrace  fuse  g1  g1gc  gae  game-days  games  gating  gc  gce  gcp  gdpr  genomics  gifee  gil-tene  gilt  gilt-groupe  git  github  gnome  go  god  golang  google  gossip  grafana  graphing  graphite  graphs  gruffalo  guides  gulp  gzip  ha  hacks  hadoop  hailo  haproxy  hardware  hbase  hdds  hdfs  health-checks  heap  heartbeats  heka  hero-coder  hero-culture  heron  hiccups  hidden-costs  history  hn  holt-winters  home  honeypot  horizon-charts  horror  hostedgraphite  hosting  hostnames  hotels  hotspot  howto  hrd  http  http2  httpry  https  huge-pages  human-factors  humor  hvm  hyperthreading  hystrix  iam  ian-wilkes  ibm  icecube  ifttt  images  imaging  inactivity  incident-models  incident-response  incidents  increment-mag  indexes  inept  influxdb  infrastructure  init  injection  insane  inspeqtor  instapaper  instrumentation  integration-testing  integration-tests  inter-region  internet  internet-scale  interviews  inviso  io  iops  iostat  ioutil  ip  ip-addresses  iptables  ironfan  james-hamilton  java  javascript  jay-kreps  jcmd  jdk  jemalloc  jenkins  jepsen  jit  jmx  jmxtrans  jobs  john-allspaw  journalling  joyent  jstat  julia-evans  juniper  jvm  k8s  kafka  kdd  kde  kellabyte  kernel  key-distribution  key-rotation  key-value  keybox  keys  keywhiz  kill-9  knife  kubernetes  lambda  languages  laptops  latency  layout  lbs  leap-second  leap-smearing  legacy  leveldb  lhm  lhtable  libc  librato  lifecycle  lifespan  limits  linden  linkedin  linkerd  links  linode  linux  liquid-cooling  listen-backlog  lists  live  lmax  load  load-balancers  load-balancing  load-testing  locking  logentries  logging  loggly  loose-coupling  lsb  lsof  lsx  luks  lxc  mac  machine-learning  macosx  madvise  mail  maintainance  mandos  manta  map-reduce  mapreduce  measurement  measurements  mechanical-sympathy  memory  mesos  metrics  mfa  microservices  microsoft  migration  migrations  mincore  mirroring  mit  ml  mmap  mocha  money  mongodb  monit  monitorama  monitoring  monolith  movies  mozilla  mpstat  mtbf  mttr  multi-cloud  multi-region  multiplexing  mysql  mytaxi  nagios  namespaces  nannies  nas  nat  natwest  nerve  netdata  netflix  nethogs  netspot  netstat  netty  network  network-monitoring  network-partitions  networking  networks  new-relic  nginx  niall-murphy  nix  nixos  nixpkgs  nlb  node.js  normalization-of-deviance  norms  nosql  notification  notifications  npm  ntp  ntpd  nuclear-power  nurse  oauth  obama  omega  omniti  on-call  oom  oom-killer  ooms  open-source  openjdk  operability  operations  ops  opsgenie  optimization  oreilly  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outbrain  outsourcing  overhead  overlay  overlayfs  ovh  owasp  packaging  packet-capture  packets  page-cache  pager  pager-duty  pagerduty  pages  paging  papers  parse  partition  partitions  passenger  patterns  paxos  pbailis  pcp  pcp2graphite  pdf  peering  percentiles  percona  performance  php  phusion  pie  pillar  pinball  ping  pinterest  piops  pipelines  pixar  pki  placement  planning  platform  platforms  playbooks  post-mortem  post-mortems  postgres  postmortem  postmortems  presentation  presentations  presos  pricing  princess  prioritisation  procedures  process  processes  procfs  prod  production  profiling  programming  prometheus  provisioning  proxies  proxy  proxying  pty  puppet  pv  python  qa  qdisc  qed-regime  questions  queueing  quick-references  rabbitmq  race-conditions  rafe-colburn  raid  rails  raintank  rami-rosen  randomization  ranking  rant  rate-limiting  rbs  rc3  rdbms  rds  read-only  reading  real-time  records  recovery  recurrence  red-hat  reddit  redis  redshift  refactoring  reference  registry  regression-testing  reinvent  release  releases  reliability  reliabilty  remediation  replicas  replication  request-routing  resiliency  resource-limits  restarting  restore  restoring  rethinkdb  reverse-proxy  reversibility  reviews  rewrites  riak  riemann  ripienaar  risks  rkt  rm-rf  rmi  rob-ewaschuk  rocket  rocksdb  rollback  root-cause  root-causes  route-53  route53  routing  rspec  ruby  rules-of-thumb  runbooks  runit  runjop  rvm  rwasa  s3  s3funnel  s3ql  saas  safety  sam  sandboxing  sanity-checks  sar  scala  scalability  scale  scaling  scary  scheduler  scheduling  schema  scripts  sdd  sdn  seagate  search  secrets  security  seesaw  segment  sensu  serf  serialization  server  serverless  servers  serverspec  service-discovery  service-metrics  service-registry  services  ses  sev1  severity  sharding  shebang  shippable  shodan  shopify  shorn-writes  signalfx  silos  sjk  skyliner  slack  slashdot  sleep  slew  slides  smartstack  smoke-tests  smtp  snappy  snapshots  sns  soa  sockets  software  solaris  soundcloud  south-pole  space  spark  sparkey  spdy  speculative-execution  spinnaker  split-brain  spot-fleet  spot-fleets  spot-instances  spotify  sql  sqs  square  sre  ssd  ssh  ssl  stack  stack-size  stackoverflow  stacks  stackshare  staging  startup  state  stateful-services  statistics  stats  statsd  statsite  status  status-pages  stephanie-dean  stepping  stolen-cpu  storage  stores  storm  strace  stratus  stream-processing  streaming  streams  stress-testing  strider  stripe  sts  supervision  supervisord  support  survey  svctm  swarm  switching  syadmin  symantec  synapse  sysadmin  sysadvent  syscalls  sysdig  syslog  sysstat  system  system-testing  system-v  systemd  systems  tahoe-lafs  talks  tc  tcp  tcpcopy  tcpdump  tdd  teams  tech  tech-debt  technical-debt  techops  tee  telefonica  telemetry  teleport  terraform  testing  tests  the-register  thp  threadpools  threads  three-mile-island  throughput  thundering-herd  ticketea  tier-one-support  tildeslash  time  time-machine  time-series  time-sync  time-synchronization  timeouts  tips  tls  tools  top  toread  tos  trace  tracer-requests  tracing  trading  traefik  training  transactional-updates  transparent-huge-pages  trivago  troubleshooting  tsd  tuning  turing-complete  twilio  twisted  twitter  two-factor-authentication  uat  ubuntu  ubuntu-core  udocker  udp  ui  ulster-bank  ultradns  unbound  unicorn  unikernels  unit-testing  unit-tests  unix  upgrades  upstart  uptime  urls  use  uselessd  usenix  user-submitted-code  usl  ux  vagrant  varnish  vector  version-control  versioning  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:eoinbrazil  via:fanf  via:feylya  via:filippo  via:highscalability  via:irldexter  via:jgilbert  via:jk  via:kragen  via:lusis  via:marc  via:markkenny  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  vips  virtualisation  virtualization  visualisation  vividcortex  vm  vms  voldemort  vpc  weave  web  web-services  webmail  weighting  whats-my-ip  wifi  wiki  winston  wipac  wireless  wishlist  wlan  work  workflow  workflows  workplaces  x86_64  xen  xfs  xooglers  yahoo  yammer  yelp  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark: