jm + ops   512

glibc changed their UTF-8 character collation ordering across versions, breaking postgres
This is terrifying:
Streaming replicas—and by extension, base backups—can become dangerously broken when the source and target machines run slightly different versions of glibc. Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on the slave. These indexes are sorted out of order with respect to the strcoll running on the slave. Because postgres is unaware of the discrepancy is uses these "corrupt" indexes to perform merge joins; merges rely heavily on the assumption that the indexes are sorted and this causes all the results of the join past the first poison pill entry to not be returned. Additionally, if the slave becomes master, the "corrupt" indexes will in cases be unable to enforce uniqueness, but quietly allow duplicate values.

Moral of the story -- keep your libc versions in sync across storage replication sets!
postgresql  scary  ops  glibc  collation  utf-8  characters  indexing  sorting  replicas  postgres 
4 days ago by jm
FFmpeg, SOX, Pandoc and RSVG for AWS Lambda
OK-ish way to add dependencies to your Lambda containers:
The basic AWS Lambda container is quite constrained, and until recently it was relatively difficult to include additional binaries into Lambda functions. Lambda Layers make that easy. A Layer is a common piece of code that is attached to your Lambda runtime in the /opt directory. You can reuse it in many functions, and deploy it only once. Individual functions do not need to include the layer code in their deployment packages, which means that the resulting functions are smaller and deploy faster. For example, at MindMup, we use Pandoc to convert markdown files into Word documents. The actual lambda function code is only a few dozen lines of JavaScript, but before layers, each deployment of the function had to include the whole Pandoc binary, larger than 100 MB. With a layer, we can publish Pandoc only once, so we use significantly less overall space for Lambda function versions. Each code change now requires just a quick redeployment.
serverless  lambda  dependencies  deployment  packaging  ops 
7 days ago by jm
AWS Service SLAs
The goal of this page is to high-light the lack of coverage AWS provides for its services across different security factors. These limitations are not well-understood by many. Further, the "Y" fields are meant to indicate that this service has any capability for the relevant factor. In many cases, this is not full coverage for the service, or there are exceptions or special cases.
amazon  aws  services  slas  ops  reliability 
7 days ago by jm
The Kinesis Scaling Utility is designed to give you the ability to scale Amazon Kinesis Streams in the same way that you scale EC2 Auto Scaling groups – up or down by a count or as a percentage of the total fleet. You can also simply scale to an exact number of Shards. There is no requirement for you to manage the allocation of the keyspace to Shards when using this API, as it is done automatically.

You can also deploy the Web Archive to a Java Application Server, and allow Scaling Utils to automatically manage the number of Shards in the Stream based on the observed PUT or GET rate of the stream.
kinesis  scaling  scalability  shards  sharding  ops 
28 days ago by jm
Applied machine learning at Facebook: a datacenter infrastructure perspective
Lots of cool details into how they've productized and scaled up their prod ML infrastructure.
As we looked at last month with Continuum, the latency of incorporating the latest data into the models is also really important. There’s a nice section of this paper where the authors study the impact of losing the ability to train models for a period of time and have to serve requests from stale models. The Community Integrity team for example rely on frequently trained models to keep up with the ever changing ways adversaries try to bypass Facebook’s protections and show objectionable content to users. Here training iterations take on the order of days. Even more dependent on the incorporation of recent data into models is the news feed ranking. “Stale News Feed models have a measurable impact on quality.” And if we look at the very core of the business, the Ads Ranking models, “we learned that the impact of leveraging a stale ML model is measured in hours. In other words, using a one-day-old model is measurably worse than using a one-hour old model.” One of the conclusions in this section of the paper is that disaster recovery / high availability for training workloads is key importance.
machine-learning  facebook  ml  training  ops  models  infrastructure  prod  production 
29 days ago by jm
Uber’s Fast, Reliable Docker Image Builder for Apache Mesos and Kubernetes.
we built our own image building tool, Makisu, a solution that allows for more flexible, faster container image building at scale. Specifically, Makisu:

requires no elevated privileges, making the build process portable.

uses a distributed layer cache to improve performance across a build cluster.

provides flexible layer generation, preventing unnecessary files in images.

is Docker-compatible, supporting multi-stage builds and common build commands.
makisu  docker  containers  ops  build  mesos  kubernetes  building 
5 weeks ago by jm
PRDD - Performance-Review Driven Development
'If the way to get promoted is to launch a shiny new product, then your most senior people will be the best at finding shiny new products to launch, even if that's not the right technical decision to make.' (from a newsy thread about Twitter's latest messaging system switch)
newsy  messaging  infrastructure  twitter  kafka  pubsub  ops  architecture  prdd  performance-reviews 
6 weeks ago by jm
The JVM in Docker 2018

Later JDK versions have made it far easier to run a JVM application in a Linux container. The memory support means that if you relied on JVM ergonomics before than you can do the same inside a container where as previously you had to override all memory related settings. The CPU support for containers needs to be carefully evaluated for your application and environment. If you’ve previously set low cpu_shares in environments like Kubernetes to increase utilisation while relying on using up unused cycles then you might get a shock.
jvm  docker  kubernetes  linux  containers  ops 
8 weeks ago by jm
'a next-generation, no-compromise automation system'.

Web-scale configuration management of all Linux/Unix systems;
Application deployment;
Immutable systems build definition;
Maintaining stateful services such as database and messaging platforms;
Automating one-off tasks & processes;
Deployment and management of the undercloud.


Python 3 DSL;
Declarative resource model with imperative capabilities;
Type / Provider plugin seperation;
Implicit ordering (with handler notification);
Formalized “Plan” vs “Apply” evaluation stages;
Early validation prior to runtime;
Programatically scoped variables;
Strong object-orientation
opsmop  ops  configuration-management  deployment  build 
8 weeks ago by jm
Some notes on running new software in production
This is really good -- how to approach new infrastructure/software dependencies in production with reliability and uptime in mind.

(via Tony Finch)
reliability  uptime  slas  kubernetes  envoy  outages  runbooks  ops 
9 weeks ago by jm
Productionproofing EKS
'We recently migrated SaleMove infrastructure from self-managed Kubernetes clusters running on AWS to using Amazon Elastic Container Service for Kubernetes (EKS). There were many surprises along the way to getting our EKS setup ready for production. This post covers some of these gotchas (others may already be fixed or are not likely to be relevant for a larger crowd) and is meant to be used as a reference when thinking of running EKS in production.'
eks  aws  docker  kubernetes  k8s  ops  prod 
10 weeks ago by jm
Block Advertising on your Network with Pi-hole and Raspberry Pi
A good walkthrough of the Pi-Hole network-wide adblocker install and operation
pi-hole  ads  blocking  ops  home  raspberry-pi 
10 weeks ago by jm
October 21 post-incident analysis | The GitHub Blog
A network outage caused a split-brain scenario, and their failover system allowed writes to occur in both
regional databases. Once the outage was repaired it was impossible to reconcile writes in an automated fashion as a result.

Embarrassingly, this exact scenario was called out in their previous blog post about their Raft-based failover system at --

"In a data center isolation scenario, and assuming a master is in the isolated DC, apps in that DC are still able to write to the master. This may result in state inconsistency once network is brought back up. We are working to mitigate this split-brain by implementing a reliable STONITH from within the very isolated DC. As before, some time will pass before bringing down the master, and there could be a short period of split-brain. The operational cost of avoiding split-brains altogether is very high."

Failover is hard.
github  fail  outages  failover  replication  consensus  ops 
10 weeks ago by jm
The Yelp Production Engineering Documentation Style Guide
This is great! Also they correctly use the term "runbook" instead of "playbook" :)
Documentation is something that many of us in software and site reliability engineering struggle with – even if we recognize its importance, it can still be a struggle to write it consistently and to write it well. While we in Yelp’s Production Engineering group are no different, over the last few quarters we’ve engaged in a concerted effort to do something about it.

One of the first steps towards changing this process was developing our documentation style guide, something that started out as a Hackathon project late last year. I spoke about it when I was giving my talk on documentation at SRECon EMEA in August, and afterwards, a number of people reached out to ask if they could have a copy.

While what we’re sharing today isn’t our exact style guide – we’ve trimmed out some of the specifics that aren’t really relevant, done a bit of rewording for a more general audience, and added some annotations – it’s essentially the one we’ve been using since the start of this year, with the caveat that it’s a living document and continues to be refined. While this may not be perfect for every team (both at Yelp and elsewhere), it’s helped us raise the bar on our own documentation and provides an example for others to follow.
yelp  pe  sre  ops  engineering  documentation  srecon  chastity-blackwell  processes 
11 weeks ago by jm
'Tries to move K8s Pods from on-demand to spot instances':

K8s Spot rescheduler is a tool that tries to reduce load on a set of Kubernetes nodes. It was designed with the purpose of moving Pods scheduled on AWS on-demand instances to AWS spot instances to allow the on-demand instances to be safely scaled down (By the Cluster Autoscaler).

In reality the rescheduler can be used to remove load from any group of nodes onto a different group of nodes. They just need to be labelled appropriately.

For example, it could also be used to allow controller nodes to take up slack while new nodes are being scaled up, and then rescheduling those pods when the new capacity becomes available, thus reducing the load on the controllers once again.
k8s  kubernetes  aws  scaling  spot-instances  ops 
12 weeks ago by jm
Running high-scale web applications on Amazon EC2 Spot Instances
AppNext's setup looks like quite good practice for a CPU-bound fleet
appnext  spot-instances  ec2  scalability  aws  ops  architecture 
october 2018 by jm
Kubernetes: The Surprisingly Affordable Platform for Personal Projects
At the beginning of the year I spent several months deep diving on Kubernetes for a project at work. As an all-inclusive, batteries-included technology for infrastructure management, Kubernetes solves many of the problems you're bound to run into at scale. However popular wisdom would suggest that Kubernetes is an overly complex piece of technology only really suitable for very large clusters of machines; that it carries a large operational burden and that therefore using it for anything less than dozens of machines is overkill.

I think that's probably wrong. Kubernetes makes sense for small projects and you can have your own Kubernetes cluster today for as little as $5 a month.

(via Tony Finch)
via:fanf  deployment  howto  kubernetes  ops  projects  hacks  clustering 
october 2018 by jm
How Triplebyte solved its office Wi-Fi problems
This is good general wi-fi infrastructure advice for home use too
internet  networking  wifi  ethernet  routers  ops 
september 2018 by jm
Cindy Sridharan on Twitter: NanoLog by Ousterhout et al.

- just formatting a log typically takes on the order of 1µs!

- nanolog achieves high throughput by shifting work out of runtime hot path into compilation + post-execution phases

Basically records symbolic form of logs, and uses a post-processor after the fact to generate readable text.
logging  ops  coding  performance 
september 2018 by jm
15 Key Takeaways from the Serverless Talk at AWS Startup Day
Best current practices for AWS Lambda usage. (still pretty messy/hacky/Rube-Goldberg-y from the looks of it tbh)
aws  lambda  serverless  ops  hacks  amazon 
july 2018 by jm
The problems with DynamoDB Auto Scaling and how it might be improved
'Based on these observations, we hypothesize that you can make two modifications to the system to improve its effectiveness:

trigger scaling up after 1 threshold breach instead of 5, which is in-line with the mantra of “scale up early, scale down slowly”;
trigger scaling activity based on actual request count instead of consumed capacity units, and calculate the new provisioned capacity units using actual request count as well.

As part of this experiment, we also prototyped these changes (by hijacking the CloudWatch alarms) to demonstrate their improvement.'
dynamodb  autoscaling  ops  scalability  aws  scaling  capacity 
july 2018 by jm
What I’ve learned from nearly three years of enterprise Wi-Fi at home
I am happy to note that I've grown out of this kind of pain (I think)....
Do you just want better Wi-Fi in every room? Consider buying a Plume or Amplifi or other similar plug-n-go mesh system. On the other hand, are you a technically proficient network kind of person who wants to build an enterprise-lite configuration at home? Do you dream of VLANs and port profiles and lovingly tweaked firewall rules? Does the idea of crawling around in your attic to ceiling-mount some access points sound like a fun way to kill a weekend? Is your office just too quiet for your liking? Buy some Ubiquiti Unifi gear and enter network nerd nirvana.
networking  wifi  wireless  ubiquiti  sdn  vlans  home  ops 
july 2018 by jm
Wifi Design Tips
PDF with a few good tips on wifi layout, AP placement etc. Also recommended: (via irldexter)
via:irldexter  wifi  802.11  wireless  ops  networking 
july 2018 by jm
Nginx tuning tips: TLS/SSL HTTPS – Improved TTFB/latency
Must do these soon on / et al.
nginx  http  https  http2  ops  tls  security  linux 
july 2018 by jm
a simple JVMTI agent that forcibly terminates the JVM when it is unable to allocate memory or create a thread. This is important for reliability purposes: an OutOfMemoryError will often leave the JVM in an inconsistent state. Terminating the JVM will allow it to be restarted by an external process manager.

This is apparently still useful despite the existence of '-XX:ExitOnOutOfMemoryError' as of java 8, since that may somehow still fail occasionally.
oom  java  reliability  uptime  memory  ops 
july 2018 by jm
Save on your AWS bill with Kubernetes Ingress
decent into to Kubernetes Ingress and the Ambassador microservices API gateway built on Envoy Proxy
envoy  proxying  kubernetes  aws  elb  load-balancing  ingress  ambassador  ops 
june 2018 by jm
Taming the Beast: How Scylla Leverages Control Theory to Keep Compactions Under Control - ScyllaDB
This is a really nice illustration of the use of control theory to set tunable thresholds automatically in a complex storage system. Nice work Scylla:

At any given moment, a database like ScyllaDB has to juggle the admission of foreground requests with background processes like compactions, making sure that the incoming workload is not severely disrupted by compactions, nor that the compaction backlog is so big that reads are later penalized.

In this article, we showed that isolation among incoming writes and compactions can be achieved by the Schedulers, yet the database is still left with the task of determining the amount of shares of the resources incoming writes and compactions will use.

Scylla steers away from user-defined tunables in this task, as they shift the burden of operation to the user, complicating operations and being fragile against changing workloads. By borrowing from the strong theoretical background of industrial controllers, we can provide an Autonomous Database that adapts to changing workloads without operator intervention.
scylladb  storage  settings  compaction  automation  thresholds  control-theory  ops  cassandra  feedback 
june 2018 by jm
How to change JVM arguments at runtime to avoid application restart
This is a super nifty feature of the JVM: turn on and off heap class histogram dumps at runtime, for instance.
java -XX:+PrintFlagsFinal -version|grep manageable
jvm  ops  switches  cli  java  heap-dumps  memory  debugging  memory-leaks 
june 2018 by jm
AWS Region Table
what products are available where
amazon  aws  regions  azs  services  architecture  ops 
june 2018 by jm
'a bash script that works like tee command. Instead of writing the standard input to files, slacktee posts it to Slack.'

(via Ardi)
via:ardi  shell  slack  ops  hacks  notification 
may 2018 by jm
schibsted/strongbox: A secret manager for AWS
Strongbox is a CLI/GUI and SDK to manage, store, and retrieve secrets (access tokens, encryption keys, private certificates, etc). Strongbox is a client-side convenience layer on top of AWS KMS, DynamoDB and IAM. It manages the AWS resources for you and configure them in a secure way. Strongbox has been used in production since mid-2016 and is now used extensively within Schibsted.
schibsted  strongbox  kms  aws  dynamodb  storage  secrets  credentials  passwords  ops 
may 2018 by jm
EC2 Instance Update – C5 Instances with Local NVMe Storage (C5d)
With a 25% to 50% improvement in price-performance over the C4 instances, the C5 instances are designed for applications like batch and log processing, distributed and or real-time analytics, high-performance computing (HPC), ad serving, highly scalable multiplayer gaming, and video encoding. Some of these applications can benefit from access to high-speed, ultra-low latency local storage. For example, video encoding, image manipulation, and other forms of media processing often necessitates large amounts of I/O to temporary storage. While the input and output files are valuable assets and are typically stored as Amazon Simple Storage Service (S3) objects, the intermediate files are expendable. Similarly, batch and log processing runs in a race-to-idle model, flushing volatile data to disk as fast as possible in order to make full use of compute resources.

Very nice!
ec2  instance-types  ops  storage  hardware  aws 
may 2018 by jm
Docker is the dangerous gamble which we will regret : devops
The article this Reddit thread links to is garbage clickbait, but the responses are insightful and much better
reddit  ops  containerization  docker  contrarians  rkt 
may 2018 by jm
Attacks against GPG signed APT repositories - Packagecloud Blog

It is a common misconception that simply signing your packages and repository metadata with GPG is enough to create a secure APT repository. This is false. Many of the attacks outlined in the paper and this blog post are effective against GPG-signed APT repositories. GPG signing Debian packages themselves does nothing, as explained below. The easiest way to prevent the attacks covered below is to always serve your APT repository over TLS; no exceptions.

This is excellent research. My faith in GPG sigs on packages is well shaken.
apt  security  debian  packaging  gpg  pgp  packages  dpkg  apt-get  ops 
may 2018 by jm
Debugging Stuck Ruby Processes — What to do Before You Kill -9
good tips on using gdb to gather backtraces (via Louise)
debugging  gdb  ruby  linux  unix  threads  ops 
april 2018 by jm
"Tweeps! What’s the craziest infra incident you worked on at Twitter"
great thread of Twitter outages and production incidents. I would love to hear more details about these, I love hearing about other people's outages ;) Even reading "over a month of cleanup and some permanent data loss" has me sweating....
infrastructure  engineering  twitter  ops  outages  production 
april 2018 by jm
Another reason why your Docker containers may be slow
TL;DR: fadvise() is a bottleneck on Linux machines running many containers
linux  fadvise  filesystems  performance  docker  containers  ops 
april 2018 by jm
Generate Mozilla Security Recommended Web Server Configuration Files
this is quite cool -- generate web server configs to activate current best-practice TLS settings
web  openssl  nginx  lighttpd  apache  haproxy  hsts  security  ssl  tls  ops 
february 2018 by jm
'Simple uptime monitoring: distributed, self-hosted health checks and status pages' -- stores in S3
go  ops  monitoring  uptime  health-checks  status-pages  status  golang  s3 
december 2017 by jm
'The missing link between AWS AutoScaling Groups and Route53 [...] solves the issue of keeping a route53 zone up to date with the changes that an autoscaling group might face.'
auto53  route-53  dns  aws  amazon  ops  hostnames  asg  autoscaling 
december 2017 by jm
AWS re:invent 2017: Container Networking Deep Dive with Amazon ECS (CON401) // Practical Applications
Another re:Invent highlight to watch -- ECS' new native container networking model explained
reinvent  aws  containers  docker  ecs  networking  sdn  ops 
december 2017 by jm
Introducing the Amazon Time Sync Service
Well overdue; includes Google-style leap smearing
time-sync  time  aws  services  ntp  ops 
november 2017 by jm
Introducing AWS Fargate – Run Containers without Managing Infrastructure
now that's a good announcement. Available right away running atop ECS; EKS in 2018
eks  ecs  fargate  aws  services  ops  containers  docker 
november 2017 by jm
'A cure for Cron's chronic email problem'
cron  linux  unix  ops  sysadmin  mail 
october 2017 by jm
IBM broke its cloud by letting three domain names expire - The Register
“multiple domain names were mistakenly allowed to expire and were in hold status.”
outages  fail  ibm  the-register  ops  dns  domains  cloud 
october 2017 by jm
'AWS Lambda cheatsheet' -- a quick ref card for Lambda users
aws  lambda  ops  serverless  reference  quick-references 
october 2017 by jm
How to operate reliable AWS Lambda applications in production
running a reliable Lambda application in production requires you to still follow operational best practices. In this article I am including some recommendations, based on my experience with operations in general as well as working with AWS Lambda.
aws  cloud  lambda  ops  amazon 
october 2017 by jm
S3 Point In Time Restore
restore a versioned S3 bucket to the state it was at at a specific point in time
ops  s3  restore  backups  versioning  history  tools  scripts  unix 
october 2017 by jm
Share scripts that have dependencies with Nix
Nice approach to one-liner packaging invocations using nix-shell
nix  packaging  unix  linux  ops  shebang  #! 
october 2017 by jm
HN thread on the new Network Load Balancer AWS product
looks like @colmmacc works on it. Lots and lots of good details here
nlb  aws  load-balancing  ops  architecture  lbs  tcp  ip 
september 2017 by jm
Going Multi-Cloud with AWS and GCP: Lessons Learned at Scale
Metamarkets splits across AWS and GCP, going into heavy detail here
aws  gcp  google  ops  hosting  multi-cloud 
august 2017 by jm
Linux Load Averages: Solving the Mystery
Nice bit of OS archaeology by Brendan Gregg.
In 1993, a Linux engineer found a nonintuitive case with load averages, and with a three-line patch changed them forever from "CPU load averages" to what one might call "system load averages." His change included tasks in the uninterruptible state, so that load averages reflected demand for disk resources and not just CPUs. These system load averages count the number of threads working and waiting to work, and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves.
load  monitoring  linux  unix  performance  ops  brendan-gregg  history  cpu 
august 2017 by jm
Arq Backs Up To B2!
Arq backup for OSX now supports B2 (as well as S3) as a storage backend.
"it’s a super-cheap option ($.005/GB per month) for storing your backups." (that is less than half the price of $0.0125/GB for S3's Infrequent Access class)
s3  storage  b2  backblaze  backups  arq  macosx  ops 
august 2017 by jm
Working with multiple AWS accounts at Ticketea
AWS STS/multiple account best practice described
sts  aws  authz  ops  ticketea  dev 
august 2017 by jm
AWS Lambda Deployment using Terraform – Build ACL – Medium
Fairly persuasive that production usage of Lambda is much easier if you go full Terraform to manage and deploy.
A complete picture of what it takes to deploy your Lambda function to production with the same diligence you apply to any other codebase using Terraform. [...] There are many cases where frameworks such as SAM or Serverless are not enough. You need more than that for a highly integrated Lambda function. In such cases, it’s easier to simply use Terraform.
infrastructure  aws  lambda  serverless  ops  terraform  sam 
august 2017 by jm
Nextflow - A DSL for parallel and scalable computational pipelines
Data-driven computational pipelines

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.

GPLv3 licensed, open source
computation  workflows  pipelines  batch  docker  ops  open-source 
august 2017 by jm
EBS gp2 I/O BurstBalance exhaustion
when EBS volumes in EC2 exhaust their "burst" allocation, things go awry very quickly
performance  aws  ebs  ec2  burst-balance  ops  debugging 
july 2017 by jm
Kubernetes Best Practices // Speaker Deck
A lot of these are general Docker/containerisation best practices, too.

(via Devops Weekly)
k8s  kubernetes  devops  ops  containers  docker  best-practices  tips  packaging 
july 2017 by jm
Amazon Web Services Elastic Compute Cloud (EC2) Rescue for Linux is a python-based tool that allows for the automatic diagnosis of common problems found on EC2 Linux instances.

Most of the modules appear to be log-greppers looking for common kernel issues.
ec2  aws  kernel  linux  ec2rl  ops 
july 2017 by jm
« earlier      
per page:    204080120160

related tags

#!  2fa  10/8  16.04  32bit  802.11  accept  accidents  accounts  accounts-daemon  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  ads  advent  advice  agpl  airbnb  airflow  airtable  aix  alarm-fatigue  alarming  alarms  alb  alert-logic  alerting  alerts  alestic  algorithms  allspaw  alter-table  ama  amazon  ambassador  ami  analysis  analytics  anomaly-detection  antarctica  anti-spam  antipatterns  anycast  ap  apache  aphyr  api  api-gateway  apis  app-engine  appnext  apt  apt-get  archaius  architecture  arq  asg  asgard  aspirations  assembly  atlas  atomic  auditd  auditing  aufs  aurora  authentication  authz  auto-remediation  auto-scaling  auto53  automation  autoremediation  autoscaling  availability  aws  awsume  az  azs  azure  b2  backblaze  background  backlog  backpressure  backup  backups  banking  banners  bare-metal  baron-schwartz  bash  basho  bastions  batch  bbc  bdb  bdb-je  bdd  beanstalk  ben-maurer  ben-treynor  benchmarking  benchmarks  best-practices  big-data  billing  bind  bit-errors  bitcoin  bitly  bitrot  blake2  blameless  bloat  blockdev  blocking  blogs  blue-green-deployments  blue-green-deploys  books  boot2docker  borg  boundary  bpf  brendan-gregg  bridge  broadcast  bryan-cantrill  bsd  btrfs  bugs  build  build-out  building  bureaucracy  burst-balance  byteman  c  c5  ca  ca-7  caches  caching  calico  campaigns  canaries  canary-requests  cap  cap-theorem  capacity  carbon  cascading-failures  case-studies  cassandra  cd  cdn  censum  certificates  certs  cfengine  cgroups  change-management  change-monitoring  changes  chaos-kong  chaos-monkey  characters  charity-majors  charts  chastity-blackwell  chatops  checkip  checklists  chef  chefspec  china  chronos  ci  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cleanup  cli  clocks  clos-networks  cloud  cloud-storage  cloudera  cloudflare  cloudfront  cloudnative  cloudwatch  cluster  clustering  clusters  cms  coda-hale  code-spaces  codeascraft  codedeploy  codel  coding  coes  coinbase  cold  collaboration  collation  command-line  commandline  commercial  compaction  company  compatibility  complexity  compression  computation  concurrency  conferences  confidence-bands  configuration  configuration-management  consensus  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  contrarians  control-theory  copy  copy-on-write  copyright  coreos  coreutils  corruption  cost-control  costs  counting  coursera  cp  cpu  crash-only-software  credentials  critiques  criu  cron  cross-region  crypto  cubism  culture  curl  d-bus  daemon  daemons  dan-luu  danilop  dark-releases  dashboards  data  data-centers  data-corruption  data-loss  database  database-is-not-a-queue  databases  datacenter  datacenters  datadog  dataviz  datawire  dba  dbus  ddl  debian  debriefing  debug  debugging  decay  defrag  delete  delivery  delta  demo  dependencies  deploy  deployinator  deployment  derp  design  desktops  dev  developers  development  deviance  devops  dht  diagnosis  digital-ocean  disaster-recovery  disk  disk-space  disks  distcomp  distributed  distributed-cron  distributed-systems  distros  diy  dmca  dns  dnsmasq  docker  documentation  domains  dotcloud  downtime  dpkg  dr  drivers  dropbox  dstat  duplicity  duply  dynalite  dynamic  dynamic-configuration  dynamodb  dynect  ebooks  ebs  ec2  ec2rl  ecs  efficiency  eks  elastic-scaling  elasticache  elasticsearch  elb  email  emr  emrfs  encryption  engineering  ensemble  environments  envoy  erasure-coding  ergonomics  error-budget  etcd  ethernet  etl  etsy  eureka  ev  event-management  eventual-consistency  exception-handling  exercises  exponential-decay  ext3  ext4  extortion  fabric  fabrics  facebook  facette  fadvise  fail  failover  failure  false-positives  fargate  fault-tolerance  fcron  feature-flags  fedora  feedback  file-transfer  filesystems  fincore  firefighting  five-whys  flapjack  flavour-of-the-month  fleet  flock  flow-logs  forecasting  foursquare  freebsd  front-ends  frontline  fs  fsync  ftrace  fuse  g1  g1gc  ga  gae  game-days  games  gating  gc  gce  gcp  gdb  gdpr  genomics  gifee  gil-tene  gilt  gilt-groupe  git  github  glibc  glitch  gnome  go  god  golang  google  gossip  gpg  grafana  graphing  graphite  graphs  gruffalo  guides  gulp  gzip  ha  hacks  hadoop  hailo  haproxy  hardware  hbase  hdds  hdfs  health-checks  heap  heap-dumps  heartbeats  heka  hero-coder  hero-culture  heron  hiccups  hidden-costs  history  hn  holt-winters  home  honeypot  horizon-charts  horror  hostedgraphite  hosting  hostnames  hotels  hotspot  howto  hrd  hsts  http  http2  httpry  https  huge-pages  human-factors  humor  hvm  hyperthreading  hystrix  iam  ian-wilkes  ibm  icecube  ifttt  images  imaging  inactivity  incident-models  incident-response  incidents  increment-mag  indexes  indexing  inept  influxdb  infrastructure  ingress  init  injection  insane  inspeqtor  instance-types  instances  instapaper  instrumentation  integration-testing  integration-tests  inter-region  internet  internet-scale  interviews  inviso  io  iops  iostat  ioutil  ip  ip-addresses  iptables  ironfan  james-hamilton  java  javascript  jay-kreps  jcmd  jdk  jemalloc  jenkins  jepsen  jit  jmx  jmxtrans  jobs  john-allspaw  journalling  joyent  jsq  jstat  julia-evans  juniper  jvm  k8s  kafka  kdd  kde  kellabyte  kernel  key-distribution  key-rotation  key-value  keybox  keys  keywhiz  kill-9  kinesis  kms  knife  kubernetes  lambda  languages  laptops  latency  layout  lbs  leap-second  leap-smearing  legacy  leveldb  lhm  lhtable  libc  librato  lifecycle  lifespan  lighttpd  limits  linden  linkedin  linkerd  links  linode  linux  liquid-cooling  listen-backlog  lists  live  lmax  load  load-balancers  load-balancing  load-testing  locking  logentries  logging  loggly  loose-coupling  lsb  lsof  lsx  luks  lxc  m5  mac  machine-learning  macosx  madvise  mail  maintainance  makisu  mandos  manta  map-reduce  mapreduce  marc-brooker  measurement  measurements  mechanical-sympathy  memory  memory-leaks  mesos  messaging  metrics  mfa  microservices  microsoft  migration  migrations  mincore  mirroring  mit  ml  mmap  mocha  models  money  mongodb  monit  monitorama  monitoring  monolith  movies  mozilla  mpstat  mq  mtbf  mttr  multi-cloud  multi-region  multiplexing  mysql  mytaxi  nagios  namespaces  nannies  nas  nat  natwest  nerve  netdata  netflix  nethogs  netspot  netstat  netty  network  network-monitoring  network-partitions  networking  networks  new-relic  newsy  nginx  niall-murphy  nix  nixos  nixpkgs  nlb  node.js  normalization-of-deviance  norms  nosql  notification  notifications  npm  ntp  ntpd  nuclear-power  nurse  oauth  obama  omega  omniti  on-call  oom  oom-killer  ooms  open-source  openjdk  openssl  operability  operations  ops  opsgenie  opsmop  optimization  oreilly  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outbrain  outsourcing  overhead  overlay  overlayfs  ovh  owasp  packages  packaging  packet-capture  packets  page-cache  pager  pager-duty  pagerduty  pages  paging  papers  parse  partition  partitions  passenger  passwords  patterns  paxos  pbailis  pcp  pcp2graphite  pdf  pe  peering  percentiles  percona  performance  performance-reviews  pgp  php  phusion  pi-hole  pie  pillar  pinball  ping  pinterest  piops  pipelines  pixar  pki  placement  planning  platform  platforms  playbooks  post-mortem  post-mortems  postgres  postgresql  postmortem  postmortems  prdd  presentation  presentations  presos  presto  pricing  princess  prioritisation  procedures  process  processes  procfs  prod  production  profiling  programming  projects  prometheus  provisioning  proxies  proxy  proxying  pty  pubsub  puppet  pv  python  qa  qdisc  qed-regime  qubole  questions  queueing  quick-references  rabbitmq  race-conditions  rafe-colburn  raid  rails  raintank  rami-rosen  randomization  ranking  rant  raspberry-pi  rate-limiting  rbs  rc3  rdbms  rds  read-only  reading  real-time  records  recovery  recurrence  red-hat  reddit  redis  redshift  refactoring  reference  regions  registry  regression-testing  reinvent  release  releases  reliability  reliabilty  remediation  replicas  replication  request-routing  resiliency  resource-limits  restarting  restore  restoring  rethinkdb  reverse-proxy  reversibility  reviews  rewrites  riak  riemann  ripienaar  risks  rkt  rm-rf  rmi  rob-ewaschuk  rocket  rocksdb  rollback  root-cause  root-causes  route-53  route53  routers  routing  rspec  ruby  rules-of-thumb  runbooks  runit  runjop  rvm  rwasa  s3  s3funnel  s3ql  saas  safety  sam  sandboxing  sanity-checks  sar  scala  scalability  scale  scaling  scary  scheduler  scheduling  schema  schibsted  scripts  scylladb  sdd  sdn  seagate  search  secrets  security  seesaw  segment  sensu  serf  serialization  server  serverless  servers  serverspec  service-discovery  service-metrics  service-registry  services  ses  settings  sev1  severity  sharding  shards  shebang  shell  shippable  shodan  shopify  shorn-writes  signalfx  silos  sjk  skyliner  slack  slas  slashdot  sleep  slew  slides  smartstack  smoke-tests  smtp  snappy  snapshots  sns  soa  sockets  software  solaris  sorting  soundcloud  south-pole  space  spark  sparkey  spdy  speculative-execution  spinnaker  split-brain  spot-fleet  spot-fleets  spot-instances  spotify  sql  sqs  square  squarespace  sre  srecon  ssd  ssh  ssl  stack  stack-size  stackoverflow  stacks  stackshare  staging  startup  state  stateful-services  statistics  stats  statsd  statsite  status  status-pages  stephanie-dean  stepping  stolen-cpu  storage  stores  storm  strace  stratus  stream-processing  streaming  streams  stress-testing  strider  stripe  strongbox  sts  supervision  supervisord  support  survey  svctm  swarm  switches  switching  syadmin  symantec  synapse  sysadmin  sysadvent  syscalls  sysdig  syslog  sysstat  system  system-testing  system-v  systemd  systems  tahoe-lafs  talks  tc  tcp  tcpcopy  tcpdump  tdd  teams  tech  tech-debt  technical-debt  techops  tee  telefonica  telemetry  teleport  terraform  testing  tests  the-register  thp  threadpools  threads  three-mile-island  thresholds  throughput  thundering-herd  ticketea  tier-one-support  tildeslash  time  time-machine  time-series  time-sync  time-synchronization  timeouts  tips  tls  tools  top  toread  tos  trace  tracer-requests  tracing  trading  traefik  training  transactional-updates  transparent-huge-pages  trivago  troubleshooting  tsd  tuning  turing-complete  twilio  twisted  twitter  two-factor-authentication  uat  ubiquiti  ubuntu  ubuntu-core  udocker  udp  ui  ulster-bank  ultradns  unbound  unicorn  unikernels  unit-testing  unit-tests  unix  upgrades  upstart  uptime  urls  use  uselessd  usenix  user-submitted-code  usl  utf-8  ux  vagrant  varnish  vector  version-control  versioning  via:aphyr  via:ardi  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:eoinbrazil  via:fanf  via:feylya  via:filippo  via:highscalability  via:irldexter  via:jgilbert  via:jk  via:kragen  via:lusis  via:marc  via:markkenny  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  vips  virtualisation  virtualization  visualisation  vividcortex  vlans  vm  vms  voldemort  vpc  weave  web  web-services  webmail  weighting  whats-my-ip  wifi  wiki  winston  wipac  wireless  wishlist  wlan  work  workflow  workflows  workplaces  x86_64  xen  xfs  xooglers  yahoo  yammer  yelp  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark: