jm + ops   188

(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014
Excellent data on current EBS performance characteristics
ebs  ops  aws  reinvent  slides 
4 days ago by jm
AWS re:Invent 2014 Video & Slide Presentation Links
Nice work by Andrew Spyker -- this should be an official feature of the re:Invent website, really
reinvent  aws  conferences  talks  slides  ec2  s3  ops  presentations 
4 days ago by jm
Microsoft Azure 9-hour outage
'From 19 Nov, 2014 00:52 to 05:50 UTC a subset of customers using Storage, Virtual Machines, SQL Geo-Restore, SQL Import/export, Websites, Azure Search, Azure Cache, Management Portal, Service Bus, Event Hubs, Visual Studio, Machine Learning, HDInsights, Automation, Virtual Network, Stream Analytics, Active Directory, StorSimple and Azure Backup Services in West US and West Europe experienced connectivity issues. This incident has now been mitigated.'

There was knock-on impact until 11:00 UTC (storage in N Europe), 11:45 UTC (websites, West Europe), and 09:15 UTC (storage, West Europe), from the looks of things. Should be an interesting postmortem.
outages  azure  microsoft  ops 
9 days ago by jm
The Infinite Hows, instead of the Five Whys
John Allspaw with an interesting assertion that we need to ask "how", not "why" in five-whys postmortems:
“Why?” is the wrong question.

In order to learn (which should be the goal of any retrospective or post-hoc investigation) you want multiple and diverse perspectives. You get these by asking people for their own narratives. Effectively, you’re asking “how?“

Asking “why?” too easily gets you to an answer to the question “who?” (which in almost every case is irrelevant) or “takes you to the ‘mysterious’ incentives and motivations people bring into the workplace.”

Asking “how?” gets you to describe (at least some) of the conditions that allowed an event to take place, and provides rich operational data.
ops  five-whys  john-allspaw  questions  postmortems  analysis  root-causes 
9 days ago by jm
A curated list of Docker resources.
linux  sysadmin  docker  ops  devops  containers  hosting 
22 days ago by jm
Zookeeper: not so great as a highly-available service registry
Turns out ZK isn't a good choice as a service discovery system, if you want to be able to use that service discovery system while partitioned from the rest of the ZK cluster:
I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances.  This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones.  What I saw was that the two other instances noticed the first server “going away”, but they continued to function as they still saw a majority (66%).  More interestingly the first instance noticed the other two servers “going away”, dropping the ensemble availability to 33%.  This caused the first server to stop serving requests to clients (not only writes, but also reads).

So: within that offline AZ, service discovery *reads* (as well as writes) stopped working due to a lack of ZK quorum. This is quite a feasible outage scenario for EC2, by the way, since (at least when I was working there) the network links between AZs, and the links with the external internet, were not 100% overlapping.

In other words, if you want a highly-available service discovery system in the fact of network partitions, you want an AP service discovery system, rather than a CP one -- and ZK is a CP system.

Another risk, noted on the Netflix Eureka mailing list at :

ZooKeeper, while tolerant against single node failures, doesn't react well to long partitioning events. For us, it's vastly more important that we maintain an available registry than a necessarily consistent registry. If us-east-1d sees 23 nodes, and us-east-1c sees 22 nodes for a little bit, that's OK with us.

I guess this means that a long partition can trigger SESSION_EXPIRED state, resulting in ZK client libraries requiring a restart/reconnect to fix. I'm not entirely clear what happens to the ZK cluster itself in this scenario though.

Finally, Pinterest ran into other issues relying on ZK for service discovery and registration, described at ; sounds like this was mainly around load and the "thundering herd" overload problem. Their workaround was to decouple ZK availability from their services' availability, by building a Smartstack-style sidecar daemon on each host which tracked/cached ZK data.
zookeeper  service-discovery  ops  ha  cap  ap  cp  service-registry  availability  ec2  aws  network  partitions  eureka  smartstack  pinterest 
24 days ago by jm
curl | sh
'People telling people to execute arbitrary code over the network. Run code from our servers as root. But HTTPS, so it’s no biggie.'

humor  sysadmin  ops  security  curl  bash  npm  rvm  chef 
25 days ago by jm
Elastic MapReduce vs S3
Turns out there are a few bugs in EMR's S3 support, believe it or not.

1. 'Consider disabling Hadoop's speculative execution feature if your cluster is experiencing Amazon S3 concurrency issues. You do this through the and mapred.reduce.tasks.speculative.execution configuration settings. This is also useful when you are troubleshooting a slow cluster.'

2. Upgrade to AMI 3.1.0 or later, otherwise retries of S3 ops don't work.
s3  emr  hadoop  aws  bugs  speculative-execution  ops 
28 days ago by jm
Stephanie Dean on event management and incident response
I asked around my ex-Amazon mates on twitter about good docs on incident response practices outside the "iron curtain", and they pointed me at this blog (which I didn't realise existed).

Stephanie Dean was the front-line ops manager for Amazon for many years, over the time where they basically *fixed* their availability problems. She since moved on to Facebook, Demonware, and Twitter. She really knows her stuff and this blog is FULL of great details of how they ran (and still run) front-line ops teams in Amazon.
ops  incident-response  outages  event-management  amazon  stephanie-dean  techops  tos  sev1 
29 days ago by jm
IT Change Management
Stephanie Dean on Amazon's approach to CMs. This is solid gold advice for any company planning to institute a sensible technical change management process
ops  tech  process  changes  change-management  bureaucracy  amazon  stephanie-dean  infrastructure 
29 days ago by jm
Carbon vs Megacarbon and Roadmap ? · Issue #235 · graphite-project/carbon
Carbon is a great idea, but fundamentally, twisted doesn't do what carbon-relay or carbon-aggregator were built to do when hit with sustained and heavy throughput. Much to my chagrin, concurrency isn't one of python's core competencies.

+1, sadly. We are patching around the edges with half-released third-party C rewrites in our graphite setup, as we exceed the scale Carbon can support.
carbon  graphite  metrics  ops  python  twisted  scalability 
4 weeks ago by jm
Game Day Exercises at Stripe: Learning from `kill -9`
We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.

Excellent post. Game days are a great idea. Also: massive Redis clustering fail
game-days  redis  testing  stripe  outages  ops  kill-9  failover 
4 weeks ago by jm
Is Docker ready for production? Feedbacks of a 2 weeks hands on
I have to agree with this assessment -- there are a lot of loose ends still for production use of Docker in a SOA stack environment:
From my point of view, Docker is probably the best thing I’ve seen in ages to automate a build. It allows to pre build and reuse shared dependencies, ensuring they’re up to date and reducing your build time. It avoids you to either pollute your Jenkins environment or boot a costly and slow Virtualbox virtual machine using Vagrant. But I don’t feel like it’s production ready in a complex environment, because it adds too much complexity. And I’m not even sure that’s what it was designed for.
docker  complexity  devops  ops  production  deployment  soa  web-services  provisioning  networking  logging 
5 weeks ago by jm
Linus Torvalds and others on Linux's systemd
ZDNet's Steven J. Vaughan-Nichols on the systemd mess (via Kragen)
via:kragen  systemd  linux  ubuntu  gnome  init  ops 
6 weeks ago by jm
a simple, lightweight HTTP server for storing and distributing custom Debian packages around your organisation. It is designed to make it as easy as possible to use Debian packages for code deployments and to ease other system administration tasks.
debian  apt  sysadmin  linux  ops  packaging 
6 weeks ago by jm
Netflix release new code to production before completing tests
Interesting -- I hadn't heard of this being an official practise anywhere before (although we actually did it ourselves this week)...
If a build has made it [past the 'integration test' phase], it is ready to be deployed to one or more internal environments for user-acceptance testing. Users could be UI developers implementing a new feature using the API, UI Testers performing end-to-end testing or automated UI regression tests. As far as possible, we strive to not have user-acceptance tests be a gating factor for our deployments. We do this by wrapping functionality in Feature Flags so that it is turned off in Production while testing is happening in other environments. 
devops  deployment  feature-flags  release  testing  integration-tests  uat  qa  production  ops  gating  netflix 
7 weeks ago by jm
"Linux Containers And The Future Cloud" [slides]
by Rami Rosen -- extremely detailed presentation into the state of Linux containers, LXC, Docker, namespaces, cgroups, and checkpoint/restore in userspace (via lusis)
lsx  docker  criu  namespaces  cgroups  linux  via:lusis  ops  containers  rami-rosen  presentations 
7 weeks ago by jm
Mike Perham on Twitter: "Sweet, monit just sent a DMCA takedown notice to @github to remove Inspeqtor."
'The work, Inspeqtor which is hosted at GitHub, is far from a “clean-room” implementation. This is basically a rewrite of Monit in Go, even using the same configuration language that is used in Monit, verbatim.

a. [private] himself admits that Inspeqtor is "heavily influenced“ by Monit

b. This tweet by [private] demonstrate intent. "OSS nerds: redesign and build monit in Go. Sell it commercially. Make $$$$. I will be your first customer.”'

IANAL, but using the same config language does not demonstrate copyright infringement...
copyright  dmca  tildeslash  monit  inspeqtor  github  ops  oss  agpl 
7 weeks ago by jm
'a set of command line tools for managing Route53 DNS for an AWS infrastructure. It intelligently uses tags and other metadata to automatically create the associated DNS records.'
zonify  aws  dns  ec2  route53  ops 
7 weeks ago by jm
'a system for allowing servers with encrypted root file systems to reboot unattended and/or remotely.' (via Tony Finch)
via:fanf  mandos  encryption  security  server  ops  sysadmin  linux 
7 weeks ago by jm
The End of Linux
'Linux is becoming the thing that we adopted Linux to get away from.'

Great post on the horrible complexity of systemd. It reminds me of nothing more than mid-90s AIX, which I had the displeasure of opsing for a while -- the Linux distros have taken a very wrong turn here.
linux  unix  complexity  compatibility  ops  rant  systemd  bloat  aix 
8 weeks ago by jm
Inviso: Visualizing Hadoop Performance
With the increasing size and complexity of Hadoop deployments, being able to locate and understand performance is key to running an efficient platform.  Inviso provides a convenient view of the inner workings of jobs and platform.  By simply overlaying a new view on existing infrastructure, Inviso can operate inside any Hadoop environment with a small footprint and provide easy access and insight.  

This sounds pretty useful.
inviso  netflix  hadoop  emr  performance  ops  tools 
8 weeks ago by jm
Avoiding Chef-Suck with Auto Scaling Groups - forty9ten
Some common problems which arise using Chef with ASGs in EC2, and how these guys avoided it -- they stopped using Chef for service provisioning, and instead baked AMIs when a new version was released. ASGs using pre-baked AMIs definitely works well so this makes good sense IMO.
infrastructure  chef  ops  asg  auto-scaling  ec2  provisioning  deployment 
9 weeks ago by jm
get page cache statistics for files.
A common question when tuning databases and other IO-intensive applications is, "is Linux caching my data or not?" pcstat gets that information for you using the mincore(2) syscall. I wrote this is so that Apache Cassandra users can see if ssTables are being cached.
linux  page-cache  caching  go  performance  cassandra  ops  mincore  fincore 
10 weeks ago by jm
Troubleshooting Production JVMs with jcmd
remotely trigger GCs, finalization, heap dumps etc. Handy
jvm  jcmd  debugging  ops  java  gc  heap  troubleshooting 
10 weeks ago by jm
The State of ZFS on Linux
Linux users familiar with other filesystems or ZFS users from other platforms will often ask whether ZFS on Linux (ZoL) is “stable”. The short answer is yes, depending on your definition of stable. The term stable itself is somewhat ambiguous.

Oh dear. that's not a good start. Good reference page, though
zfs  linux  filesystems  ops  solaris 
10 weeks ago by jm
'turns a fresh cloud computer into a working mail server. You get contact synchronization, spam filtering, and so on. On your phone, you can use apps like K-9 Mail and CardDAV-Sync free beta to sync your email and contacts between your phone and your box.'

(via Tony Finch)
via:fanf  mail  diy  hosting  webmail  ops 
11 weeks ago by jm
Using spot instances
Excellent post on all of the ins and outs of EC2 spot instance usage
ec2  aws  spot-instances  pricing  cloud  auto-scaling  ops 
12 weeks ago by jm
Nix: The Purely Functional Package Manager
'a powerful package manager for Linux and other Unix systems that makes package management reliable and reproducible. It provides atomic upgrades and rollbacks, side-by-side installation of multiple versions of a package, multi-user package management and easy setup of build environments. '

Basically, this is a third-party open source reimplementation of Amazon's (excellent) internal packaging system, using symlinks to versioned package directories to ensure atomicity and the ability to roll back. This is definitely the *right* way to build packages -- I know what tool I'll be pushing for, next time this question comes up.

See also for a Linux distro built on Nix.
ops  linux  devops  unix  packaging  distros  nix  nixos  atomic  upgrades  rollback  versioning 
12 weeks ago by jm
Applying cardiac alarm management techniques to your on-call
An ops-focused take on a recent story about alarm fatigue, and how a Boston hospital dealt with it. When I was in Amazon, many of the teams in our division had a target to reduce false positive pages, with a definite monetary value attached to it, since many teams had "time off in lieu" payments for out-of-hours pages to the on-call staff. As a result, reducing false-positive pages was reasonably high priority and we dealt with this problem very proactively, with a well-developed sense of how to do so. It's interesting to see how the outside world is only just starting to look into its amelioration. (Another benefit of a TOIL policy ;)
ops  monitoring  sysadmin  alerts  alarms  nagios  alarm-fatigue  false-positives  pages 
12 weeks ago by jm
On-Demand Jenkins Slaves With Amazon EC2
This is very likely where we'll be going for our acceptance tests in Swrve
testing  jenkins  ec2  spot-instances  scalability  auto-scaling  ops  build 
august 2014 by jm
Apache Kafka 0.8 basic training
This is a pretty voluminous and authoritative presentation about getting started with Kafka; wish this was around when we started using it for 0.7. (We use our own homegrown realtime system nowadays, due to better partitioning, monitoring and operability.)
storm  kafka  presentations  documentation  ops 
august 2014 by jm
Logentries Announces Machine Learning Analytics for IT Ops Monitoring and Real-time Alerting
This sounds pretty neat:
With Logentries Anomaly Detection, users can:

Set-up real-time alerting based on deviations from important patterns and log events.
Easily customize Anomaly thresholds and compare different time periods.

With Logentries Inactivity Alerting, users can:

Monitor standard, incoming events such as an application heart beat.
Receive real-time alerts based on log inactivity (i.e. receive alerts when something does not occur).
logging  syslog  logentries  anomaly-detection  ops  machine-learning  inactivity  alarms  alerting  heartbeats 
august 2014 by jm
Box Tech Blog » A Tale of Postmortems
How Box introduced COE-style dev/ops outage postmortems, and got them working. This PIE metric sounds really useful to head off the dreaded "it'll all have to come out missus" action item:
The picture was getting clearer, and we decided to look into individual postmortems and action items and see what was missing. As it was, action items were wasting away with no owners. Digging deeper, we noticed that many action items entailed massive refactorings or vague requirements like “make system X better” (i.e. tasks that realistically were unlikely to be addressed). At a higher level, postmortem discussions often devolved into theoretical debates without a clear outcome. We needed a way to lower and focus the postmortem bar and a better way to categorize our action items and our technical debt.

Out of this need, PIE (“Probability of recurrence * Impact of recurrence * Ease of addressing”) was born. By ranking each factor from 1 (“low”) to 5 (“high”), PIE provided us with two critical improvements:

1. A way to police our postmortems discussions. I.e. a low probability, low impact, hard to implement solution was unlikely to get prioritized and was better suited to a discussion outside the context of the postmortem. Using this ranking helped deflect almost all theoretical discussions.
2. A straightforward way to prioritize our action items.

What’s better is that once we embraced PIE, we also applied it to existing tech debt work. This was critical because we could now prioritize postmortem action items alongside existing work. Postmortem action items became part of normal operations just like any other high-priority work.
postmortems  action-items  outages  ops  devops  pie  metrics  ranking  refactoring  prioritisation  tech-debt 
august 2014 by jm
The Network is Reliable - ACM Queue
Peter Bailis and Kyle Kingsbury accumulate a comprehensive, informal survey of real-world network failures observed in production. I remember that April 2011 EBS outage...
ec2  aws  networking  outages  partitions  jepsen  pbailis  aphyr  acm-queue  acm  survey  ops 
july 2014 by jm
iosnoop For Linux
it's a shell script! ftrace-based tool to snoop on Linux disk I/O and trace system-wide activity, more-or-less attributing it to the correct process
linux  disk  io  tracing  trace  ops  ftrace 
july 2014 by jm
Boundary's new server monitoring free offering
'High resolution, 1 second intervals for all metrics; Fluid analytics, drag any graph to any point in time; Smart alarms to cut down on false positives; Embedded graphs and customizable dashboards; Up to 10 servers for free'

Pre-registration is open now. Could be interesting, although the limit of 10 machines is pretty small for any production usage
boundary  monitoring  network  ops  metrics  alarms  tcp  ip  netstat 
july 2014 by jm
'This tool can be described as a Tiny Dirty Linux Only C command that looks for coreutils basic commands (cp, mv, dd, tar, gzip/gunzip, cat, ...) currently running on your system and displays the percentage of copied data. It can now also display an estimated throughput (using -w flag).'
coreutils  via:pixelbeat  linux  ops  hacks  procfs  dataviz  unix 
july 2014 by jm
Latest EBS tuning tips
from yesterday's AWS Summit in NYC:

Cheat sheet of EBS-optimized instances.
Optimize your queue depth to achieve lower latency & highest IOPS.
When configuring your RAID, use a stripe size of 128KB or 256KB.
Use larger block size to speed up the pre-warming process.
ebs  aws  amazon  iops  raid  ops  tuning 
july 2014 by jm
Two traps in iostat: %util and svctm
Marc Brooker:
As a measure of general IO busyness %util is fairly handy, but as an indication of how much the system is doing compared to what it can do, it's terrible. Iostat's svctm has even fewer redeeming strengths. It's just extremely misleading for most modern storage systems and workloads. Both of these fields are likely to mislead more than inform on modern SSD-based storage systems, and their use should be treated with extreme care.
ioutil  iostat  svctm  ops  ssd  disks  hardware  metrics  stats  linux 
july 2014 by jm
Urban Airship with a new open-source Graphite front-end UI; similar enough to Grafana at a glance, no releases yet, ASL2-licensed
graphite  metrics  ui  front-ends  open-source  ops 
july 2014 by jm
Delivery Notifications for Simple Email Service
Today we are enhancing SES with the addition of delivery notifications. You can now elect to receive an Amazon SNS notification each time SES successfully delivers a message to a recipient's email server. These notifications give you increased visibility into the mail delivery process. With today's release, you can now track deliveries, bounces, and complaints, all via notification to the SNS topic or topics of your choice.
delivery  email  smtp  ses  aws  sns  notifications  ops 
june 2014 by jm
Amazon EC2 Service Limits Report Now Available
'designed to make it easier for you to view and manage your limits for Amazon EC2 by providing the latest information on service limits and links to quickly request limit increases. EC2 Service Limits Report displays all your service limit information in one place to help you avoid encountering limits on future EC2, EBS, Auto Scaling, and VPC usage.'
aws  ec2  vpc  ebs  autoscaling  limits  ops 
june 2014 by jm
Code Spaces data and backups deleted by hackers
Rather scary story of an extortionist wiping out a company's AWS-based infrastructure. Turns out S3 supports MFA-required deletion as a feature, though, which would help against that.
ops  security  extortion  aws  ec2  s3  code-spaces  delete  mfa  two-factor-authentication  authentication  infrastructure 
june 2014 by jm
Call me maybe: Elasticsearch
Wow, these are terrible results. From the sounds of it, ES just cannot deal with realistic outage scenarios and is liable to suffer catastrophic damage in reasonably-common partitions.
If you are an Elasticsearch user (as I am): good luck. Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present. If you can, store your data in a safer database, and feed it into Elasticsearch gradually. Have processes in place that continually traverse the system of record, so you can recover from ES data loss automatically.
elasticsearch  ops  storage  databases  jepsen  partition  network  outages  reliability 
june 2014 by jm
Call me maybe: RabbitMQ
We used Knossos and Jepsen to prove the obvious: RabbitMQ is not a lock service. That investigation led to a discovery hinted at by the documentation: in the presence of partitions, RabbitMQ clustering will not only deliver duplicate messages, but will also drop huge volumes of acknowledged messages on the floor. This is not a new result, but it may be surprising if you haven’t read the docs closely–especially if you interpreted the phrase “chooses Consistency and Partition Tolerance” to mean, well, either of those things.
rabbitmq  network  partitions  failure  cap-theorem  consistency  ops  reliability  distcomp  jepsen 
june 2014 by jm
Manages migrations for your Cassandra data stores. Pillar grew from a desire to automatically manage Cassandra schema as code. Managing schema as code enables automated build and deployment, a foundational practice for an organization striving to achieve Continuous Delivery.

Pillar is to Cassandra what Rails ActiveRecord migrations or Play Evolutions are to relational databases with one key difference: Pillar is completely independent from any application development framework.
migrations  database  ops  pillar  cassandra  activerecord  scala  continuous-delivery  automation  build 
june 2014 by jm's reference page for java.lang.OutOfMemoryError
With examples of each possible cause of a Java OOM, and suggested workarounds. succinct
reference  oom  java  memory  heap  ops 
june 2014 by jm
Database Migrations Done Right
The rule is simple. You should never tie database migrations to application deploys or vice versa. By minimising dependencies you enable faster, easier and cleaner deployments.

A solid description of why this is a good idea, from an ex-Guardian dev.
migrations  database  sql  mysql  postgres  deployment  ops  dependencies  loose-coupling 
may 2014 by jm
interview with Google VP of SRE Ben Treynor
interviewed by Niall Murphy, no less ;). Some good info on what Google deems important from an ops/SRE perspective
sre  ops  devops  google  monitoring  interviews  ben-treynor 
may 2014 by jm
Docker Plugin for Jenkins
The aim of the docker plugin is to be able to use a docker host to dynamically provision a slave, run a single build, then tear-down that slave. Optionally, the container can be committed, so that (for example) manual QA could be performed by the container being imported into a local docker provider, and run from there.

The holy grail of Jenkins/Docker integration. How cool is that...
jenkins  docker  ops  testing  ec2  hosting  scaling  elastic-scaling  system-testing 
may 2014 by jm
SmartStack vs. Consul
One of the SmartStack developers at AirBNB responds to's comments. FWIW, we use SmartStack in Swrve and it works pretty well...
smartstack  airbnb  ops  consul  serf  load-balancing  availability  resiliency  network-partitions  outages 
may 2014 by jm
Go: Best Practices for Production Environments
how Soundcloud deploy their Go services, after 2.5 years of Go in production
go  tips  deployment  best-practices  soundcloud  ops 
april 2014 by jm
AWS Elastic Beanstalk for Docker
This is pretty amazing. nice work, Beanstalk team. not sure how well it integrates with the rest of AWS though
aws  amazon  docker  ec2  beanstalk  ops  containers  linux 
april 2014 by jm
Fcron is a scheduler. It aims at replacing Vixie Cron, so it implements most of its functionalities. But contrary to Vixie Cron, fcron does not need your system to be up 7 days a week, 24 hours a day : it also works well with systems which are running only occasionnally (contrary to anacrontab). In other words, fcron does both the job of Vixie Cron and anacron, but does even more and better :)) ...

Thanks Craig!
via:chughes  cron  fcron  unix  linux  ops  scheduler  automation  scripts 
april 2014 by jm
Linode announces new instance specs
'TL;DR: SSDs + Insane network + Faster processors + Double the RAM + Hourly Billing'
hosting  linode  ssd  performance  linux  ops  datacenters 
april 2014 by jm
'a command line tool for Amazon's Simple Storage Service (S3). Written in Python, easy_install the package to install as an egg. Supports multithreaded operations for large volumes. Put, get, or delete many items concurrently, using a fixed-size pool of threads. Built on workerpool for multithreading and boto for access to the Amazon S3 API. Unix-friendly input and output. Pipe things in, out, and all around.'

MIT-licensed open source. (via Paul Dolan)
via:pdolan  s3  s3funnel  tools  ops  aws  python  mit  open-source 
april 2014 by jm
"H" in cron syntax
This is something Jenkins have come up to randomize and distribute load, in order to avoid the "thundering-herd" bug. Good call
jenkins  randomization  load-balancing  load  thundering-herd  ops  capacity  sleep 
april 2014 by jm
Dead Man's Snitch
a cron job monitoring tool that keeps an eye on your periodic processes and notifies you when something doesn't happen. Daily backups, monthly emails, or cron jobs you need to monitor? Dead Man's Snitch has you covered. Know immediately when one of these processes doesn't work.

via Marc.
alerts  cron  monitoring  sysadmin  ops  backups  alarms 
april 2014 by jm
open source, system-level exploration: capture system state and activity from a running Linux instance, then save, filter and analyze.
Think of it as strace + tcpdump + lsof + awesome sauce.
With a little Lua cherry on top.

This sounds excellent. Linux-based, GPLv2.
debugging  tools  linux  ops  tracing  strace  open-source  sysdig  cli  tcpdump  lsof 
april 2014 by jm
Adrian Cockroft's Cloud Outage Reports Collection
The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. [....] I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them.
outages  post-mortems  documentation  ops  aws  ec2  amazon  google  dropbox  microsoft  azure  incident-response 
march 2014 by jm
a utility to perform parallel, pipelined execution of a single HTTP GET. htcat is intended for the purpose of incantations like: htcat | tar -zx

It is tuned (and only really useful) for faster interconnects: [....] 109MB/s on a gigabit network, between an AWS EC2 instance and S3. This represents 91% use of the theoretical maximum of gigabit (119.2 MiB/s).
go  cli  http  file-transfer  ops  tools 
march 2014 by jm
a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X.

S3QL is a standard conforming, full featured UNIX file system that is conceptually indistinguishable from any local file system. Furthermore, S3QL has additional features like compression, encryption, data de-duplication, immutable trees and snapshotting which make it especially suitable for online backup and archival.
s3  s3ql  backup  aws  filesystems  linux  freebsd  osx  ops 
march 2014 by jm
ZooKeeper Resilience at Pinterest
essentially decoupling the client services from ZK using a local daemon on each client host; very similar to Airbnb's Smartstack. This is a bit of an indictment of ZK's usability though
ops  architecture  clustering  network  partitions  cap  reliability  smartstack  airbnb  pinterest  zookeeper 
march 2014 by jm
Migrating from MongoDB to Cassandra
Interesting side-effect of using LUKS for full-disk encryption: 'For every disk read, we were pulling in 3MB of data (RA is sectors, SSZ is sector size, 6144*512=3145728 bytes) into cache. Oops. Not only were we doing tons of extra work, but we were trashing our page cache too. The default for the device-mapper used by LUKS under Ubuntu 12.04LTS is incredibly sub-optimal for database usage, especially our usage of Cassandra (more small random reads vs. large rows). We turned this down to 128 sectors — 64KB.'
cassandra  luks  raid  linux  tuning  ops  blockdev  disks  sdd 
february 2014 by jm
Yammer Engineering - Resiliency at Yammer
Not content with adding Hystrix (circuit breakers, threadpooling, request time limiting, metrics, etc.) to their entire SOA stack, they've made it incredibly configurable by hooking in a web-based configuration UI, allowing dynamic on-the-fly reconfiguration by their ops guys of the circuit breakers and threadpools in production. Mad stuff
hystrix  circuit-breakers  resiliency  yammer  ops  threadpools  soa  dynamic-configuration  archaius  netflix 
january 2014 by jm
10 Things We Forgot to Monitor
a list of not-so-common outage causes which are easy to overlook; swap rate, NTP drift, SSL expiration, fork rate, etc.
nagios  metrics  ops  monitoring  systems  ntp  bitly 
january 2014 by jm
Hero Culture
Good description of the "hero coder" organisational antipattern.
Now imagine that most of the team is involved in fire-fighting. New recruits see the older recruits getting praised for their brave work in the line-of-fire and they want that kind of praise and reward too. Before long everyone is focused on putting out fires and it is no ones interest to step back and take on the risks that long-term DevOps-focused goals entail.
coding  ops  admin  hero-coder  hero-culture  firefighting  organisations  teams  culture 
january 2014 by jm
Cassandra: tuning the JVM for read heavy workloads
The cluster we tuned is hosted on AWS and is comprised of 6 hi1.4xlarge EC2 instances, with 2 1TB SSDs raided together in a raid 0 configuration. The cluster’s dataset is growing steadily. At the time of this writing, our dataset is 341GB, up from less than 200GB a few months ago, and is growing by 2-3GB per day. The workload on this cluster is very read heavy, with quorum reads making up 99% of all operations.

Some careful GC tuning here. Probably not applicable to anyone else, but good approach in general.
java  performance  jvm  scaling  gc  tuning  cassandra  ops 
january 2014 by jm
Backblaze Blog » What Hard Drive Should I Buy?
Because Backblaze has a history of openness, many readers expected more details in my previous posts. They asked what drive models work best and which last the longest. Given our experience with over 25,000 drives, they asked which ones are good enough that we would buy them again. In this post, I’ll answer those questions.
backblaze  backup  hardware  hdds  storage  disks  ops  via:fanf 
january 2014 by jm
« earlier      
per page:    204080120160

related tags

accidents  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  agpl  airbnb  aix  alarm-fatigue  alarming  alarms  alert-logic  alerting  alerts  alestic  algorithms  alter-table  ama  amazon  analysis  analytics  anomaly-detection  antarctica  anti-spam  antipatterns  ap  apache  aphyr  app-engine  apt  archaius  architecture  asg  asgard  atomic  authentication  auto-scaling  automation  autoremediation  autoscaling  availability  aws  az  azure  backblaze  backup  backups  banking  baron-schwartz  bash  basho  bdb  bdb-je  bdd  beanstalk  ben-treynor  benchmarks  best-practices  big-data  billing  bit-errors  bitcoin  bitly  bitrot  bloat  blockdev  blogs  boundary  broadcast  bugs  build  build-out  bureaucracy  ca-7  caching  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  case-studies  cassandra  censum  cfengine  cgroups  change-management  change-monitoring  changes  chaos-monkey  checklists  chef  chefspec  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cli  cloud  cloudera  cloudwatch  cluster  clustering  clusters  cms  code-spaces  coding  coinbase  cold  collaboration  command-line  commercial  company  compatibility  complexity  compression  concurrency  conferences  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy-on-write  copyright  coreutils  corruption  cp  crash-only-software  criu  cron  culture  curl  daemon  daemons  dashboards  data  data-corruption  database  databases  datacenters  dataviz  debian  debug  debugging  decay  defrag  delete  delivery  demo  dependencies  deploy  deployinator  deployment  desktops  dev  developers  devops  diagnosis  digital-ocean  disk  disks  distcomp  distributed  distributed-systems  distros  diy  dmca  dns  docker  documentation  dotcloud  drivers  dropbox  dstat  duplicity  duply  dynamic-configuration  dynect  ebs  ec2  elastic-scaling  elasticsearch  email  emr  encryption  engineering  ensemble  erasure-coding  etsy  eureka  event-management  eventual-consistency  exercises  exponential-decay  extortion  fabric  facebook  fail  failover  failure  false-positives  fault-tolerance  fcron  feature-flags  file-transfer  filesystems  fincore  firefighting  five-whys  flapjack  flock  forecasting  foursquare  freebsd  front-ends  fs  fsync  ftrace  g1  g1gc  gae  game-days  gating  gc  gilt-groupe  git  github  gnome  go  god  google  gossip  graphing  graphite  graphs  gzip  ha  hacks  hadoop  haproxy  hardware  hdds  heap  heartbeats  hero-coder  hero-culture  hn  holt-winters  home  honeypot  hosting  hotspot  hrd  http  huge-pages  humor  hystrix  ian-wilkes  ibm  icecube  images  inactivity  incident-response  inept  infrastructure  init  inspeqtor  instrumentation  integration-tests  internet  interviews  inviso  io  iops  iostat  ioutil  ip  iptables  ironfan  java  jay-kreps  jcmd  jdk  jenkins  jepsen  jmx  jmxtrans  john-allspaw  juniper  jvm  kafka  kdd  kde  kellabyte  kernel  kill-9  knife  laptops  latency  legacy  leveldb  lifespan  limits  linden  linkedin  links  linode  linux  live  load  load-balancers  load-balancing  locking  logentries  logging  loose-coupling  lsb  lsof  lsx  luks  lxc  mac  machine-learning  macosx  mail  maintainance  mandos  map-reduce  measurements  memory  mesos  metrics  mfa  microsoft  migrations  mincore  mirroring  mit  mongodb  monit  monitorama  monitoring  movies  mtbf  mysql  nagios  namespaces  nannies  nas  natwest  nerve  netflix  netstat  network  network-monitoring  network-partitions  networking  networks  nginx  nix  nixos  node.js  nosql  notification  notifications  npm  ntp  ntpd  obama  omniti  oom  open-source  openjdk  operations  ops  optimization  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outsourcing  packaging  page-cache  pager-duty  pagerduty  pages  paging  papers  partition  partitions  passenger  paxos  pbailis  pdf  percona  performance  phusion  pie  pillar  pinterest  piops  pixar  post-mortems  postgres  postmortems  presentations  pricing  prioritisation  procedures  process  processes  procfs  production  profiling  programming  provisioning  pty  puppet  python  qa  questions  queueing  rabbitmq  raid  rails  rami-rosen  randomization  ranking  rant  rate-limiting  rbs  rdbms  rds  recovery  reddit  redis  redshift  refactoring  reference  reinvent  release  reliability  remediation  replicas  replication  resiliency  restoring  riak  risks  rm-rf  rollback  root-cause  root-causes  route53  routing  rspec  ruby  runbooks  runit  rvm  s3  s3funnel  s3ql  safety  sanity-checks  scala  scalability  scaling  scheduler  schema  scripts  sdd  seagate  search  security  sensu  serf  serialization  server  servers  serverspec  service-discovery  service-metrics  service-registry  services  ses  sev1  sharding  shopify  shorn-writes  silos  sleep  slew  slides  smartstack  smtp  snappy  snapshots  sns  soa  software  solaris  soundcloud  south-pole  space  speculative-execution  split-brain  spot-instances  sql  sre  ssd  ssh  stack  statistics  stats  statsd  statsite  stephanie-dean  stepping  storage  storm  strace  streaming  strider  stripe  supervision  supervisord  support  survey  svctm  syadmin  synapse  sysadmin  sysdig  syslog  system-testing  systemd  systems  tahoe-lafs  talks  tcp  tcpdump  tdd  teams  tech  tech-debt  techops  testing  threadpools  throughput  thundering-herd  tildeslash  time  time-machine  time-synchronization  tips  tools  tos  trace  tracer-requests  tracing  trading  training  transparent-huge-pages  troubleshooting  tuning  turing-complete  twilio  twisted  twitter  two-factor-authentication  uat  ubuntu  ui  ulster-bank  ultradns  unicorn  unit-testing  unit-tests  unix  upgrades  upstart  usenix  vagrant  versioning  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:fanf  via:filippo  via:jk  via:kragen  via:lusis  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  virtualisation  virtualization  vm  vms  voldemort  vpc  web  web-services  webmail  weighting  wiki  wipac  work  yammer  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark: