jm + ops   149

Box Tech Blog » A Tale of Postmortems
How Box introduced COE-style dev/ops outage postmortems, and got them working. This PIE metric sounds really useful to head off the dreaded "it'll all have to come out missus" action item:
The picture was getting clearer, and we decided to look into individual postmortems and action items and see what was missing. As it was, action items were wasting away with no owners. Digging deeper, we noticed that many action items entailed massive refactorings or vague requirements like “make system X better” (i.e. tasks that realistically were unlikely to be addressed). At a higher level, postmortem discussions often devolved into theoretical debates without a clear outcome. We needed a way to lower and focus the postmortem bar and a better way to categorize our action items and our technical debt.

Out of this need, PIE (“Probability of recurrence * Impact of recurrence * Ease of addressing”) was born. By ranking each factor from 1 (“low”) to 5 (“high”), PIE provided us with two critical improvements:

1. A way to police our postmortems discussions. I.e. a low probability, low impact, hard to implement solution was unlikely to get prioritized and was better suited to a discussion outside the context of the postmortem. Using this ranking helped deflect almost all theoretical discussions.
2. A straightforward way to prioritize our action items.

What’s better is that once we embraced PIE, we also applied it to existing tech debt work. This was critical because we could now prioritize postmortem action items alongside existing work. Postmortem action items became part of normal operations just like any other high-priority work.
postmortems  action-items  outages  ops  devops  pie  metrics  ranking  refactoring  prioritisation  tech-debt 
3 days ago by jm
The Network is Reliable - ACM Queue
Peter Bailis and Kyle Kingsbury accumulate a comprehensive, informal survey of real-world network failures observed in production. I remember that April 2011 EBS outage...
ec2  aws  networking  outages  partitions  jepsen  pbailis  aphyr  acm-queue  acm  survey  ops 
27 days ago by jm
iosnoop For Linux
it's a shell script! ftrace-based tool to snoop on Linux disk I/O and trace system-wide activity, more-or-less attributing it to the correct process
linux  disk  io  tracing  trace  ops  ftrace 
5 weeks ago by jm
Boundary's new server monitoring free offering
'High resolution, 1 second intervals for all metrics; Fluid analytics, drag any graph to any point in time; Smart alarms to cut down on false positives; Embedded graphs and customizable dashboards; Up to 10 servers for free'

Pre-registration is open now. Could be interesting, although the limit of 10 machines is pretty small for any production usage
boundary  monitoring  network  ops  metrics  alarms  tcp  ip  netstat 
5 weeks ago by jm
Xfennec/cv
'This tool can be described as a Tiny Dirty Linux Only C command that looks for coreutils basic commands (cp, mv, dd, tar, gzip/gunzip, cat, ...) currently running on your system and displays the percentage of copied data. It can now also display an estimated throughput (using -w flag).'
coreutils  via:pixelbeat  linux  ops  hacks  procfs  dataviz  unix 
5 weeks ago by jm
Latest EBS tuning tips
from yesterday's AWS Summit in NYC:

Cheat sheet of EBS-optimized instances. http://t.co/vmTlhUtpWk
Optimize your queue depth to achieve lower latency & highest IOPS. http://t.co/EO48oa0D6X
When configuring your RAID, use a stripe size of 128KB or 256KB. http://t.co/N0ldtFJ4t6
Use larger block size to speed up the pre-warming process. http://t.co/8UoIeWE2px
ebs  aws  amazon  iops  raid  ops  tuning 
5 weeks ago by jm
Two traps in iostat: %util and svctm
Marc Brooker:
As a measure of general IO busyness %util is fairly handy, but as an indication of how much the system is doing compared to what it can do, it's terrible. Iostat's svctm has even fewer redeeming strengths. It's just extremely misleading for most modern storage systems and workloads. Both of these fields are likely to mislead more than inform on modern SSD-based storage systems, and their use should be treated with extreme care.
ioutil  iostat  svctm  ops  ssd  disks  hardware  metrics  stats  linux 
6 weeks ago by jm
Tessera
Urban Airship with a new open-source Graphite front-end UI; similar enough to Grafana at a glance, no releases yet, ASL2-licensed
graphite  metrics  ui  front-ends  open-source  ops 
7 weeks ago by jm
Delivery Notifications for Simple Email Service
Today we are enhancing SES with the addition of delivery notifications. You can now elect to receive an Amazon SNS notification each time SES successfully delivers a message to a recipient's email server. These notifications give you increased visibility into the mail delivery process. With today's release, you can now track deliveries, bounces, and complaints, all via notification to the SNS topic or topics of your choice.
delivery  email  smtp  ses  aws  sns  notifications  ops 
7 weeks ago by jm
Amazon EC2 Service Limits Report Now Available
'designed to make it easier for you to view and manage your limits for Amazon EC2 by providing the latest information on service limits and links to quickly request limit increases. EC2 Service Limits Report displays all your service limit information in one place to help you avoid encountering limits on future EC2, EBS, Auto Scaling, and VPC usage.'
aws  ec2  vpc  ebs  autoscaling  limits  ops 
7 weeks ago by jm
Code Spaces data and backups deleted by hackers
Rather scary story of an extortionist wiping out a company's AWS-based infrastructure. Turns out S3 supports MFA-required deletion as a feature, though, which would help against that.
ops  security  extortion  aws  ec2  s3  code-spaces  delete  mfa  two-factor-authentication  authentication  infrastructure 
8 weeks ago by jm
Call me maybe: Elasticsearch
Wow, these are terrible results. From the sounds of it, ES just cannot deal with realistic outage scenarios and is liable to suffer catastrophic damage in reasonably-common partitions.
If you are an Elasticsearch user (as I am): good luck. Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present. If you can, store your data in a safer database, and feed it into Elasticsearch gradually. Have processes in place that continually traverse the system of record, so you can recover from ES data loss automatically.
elasticsearch  ops  storage  databases  jepsen  partition  network  outages  reliability 
8 weeks ago by jm
Call me maybe: RabbitMQ
We used Knossos and Jepsen to prove the obvious: RabbitMQ is not a lock service. That investigation led to a discovery hinted at by the documentation: in the presence of partitions, RabbitMQ clustering will not only deliver duplicate messages, but will also drop huge volumes of acknowledged messages on the floor. This is not a new result, but it may be surprising if you haven’t read the docs closely–especially if you interpreted the phrase “chooses Consistency and Partition Tolerance” to mean, well, either of those things.
rabbitmq  network  partitions  failure  cap-theorem  consistency  ops  reliability  distcomp  jepsen 
9 weeks ago by jm
Pillar
Manages migrations for your Cassandra data stores. Pillar grew from a desire to automatically manage Cassandra schema as code. Managing schema as code enables automated build and deployment, a foundational practice for an organization striving to achieve Continuous Delivery.

Pillar is to Cassandra what Rails ActiveRecord migrations or Play Evolutions are to relational databases with one key difference: Pillar is completely independent from any application development framework.
migrations  database  ops  pillar  cassandra  activerecord  scala  continuous-delivery  automation  build 
9 weeks ago by jm
Plumbr.eu's reference page for java.lang.OutOfMemoryError
With examples of each possible cause of a Java OOM, and suggested workarounds. succinct
reference  plumbr.eu  oom  java  memory  heap  ops 
9 weeks ago by jm
Database Migrations Done Right
The rule is simple. You should never tie database migrations to application deploys or vice versa. By minimising dependencies you enable faster, easier and cleaner deployments.


A solid description of why this is a good idea, from an ex-Guardian dev.
migrations  database  sql  mysql  postgres  deployment  ops  dependencies  loose-coupling 
may 2014 by jm
interview with Google VP of SRE Ben Treynor
interviewed by Niall Murphy, no less ;). Some good info on what Google deems important from an ops/SRE perspective
sre  ops  devops  google  monitoring  interviews  ben-treynor 
may 2014 by jm
Docker Plugin for Jenkins
The aim of the docker plugin is to be able to use a docker host to dynamically provision a slave, run a single build, then tear-down that slave. Optionally, the container can be committed, so that (for example) manual QA could be performed by the container being imported into a local docker provider, and run from there.


The holy grail of Jenkins/Docker integration. How cool is that...
jenkins  docker  ops  testing  ec2  hosting  scaling  elastic-scaling  system-testing 
may 2014 by jm
SmartStack vs. Consul
One of the SmartStack developers at AirBNB responds to Consul.io's comments. FWIW, we use SmartStack in Swrve and it works pretty well...
smartstack  airbnb  ops  consul  serf  load-balancing  availability  resiliency  network-partitions  outages 
may 2014 by jm
Go: Best Practices for Production Environments
how Soundcloud deploy their Go services, after 2.5 years of Go in production
go  tips  deployment  best-practices  soundcloud  ops 
april 2014 by jm
AWS Elastic Beanstalk for Docker
This is pretty amazing. nice work, Beanstalk team. not sure how well it integrates with the rest of AWS though
aws  amazon  docker  ec2  beanstalk  ops  containers  linux 
april 2014 by jm
fcron
Fcron is a scheduler. It aims at replacing Vixie Cron, so it implements most of its functionalities. But contrary to Vixie Cron, fcron does not need your system to be up 7 days a week, 24 hours a day : it also works well with systems which are running only occasionnally (contrary to anacrontab). In other words, fcron does both the job of Vixie Cron and anacron, but does even more and better :)) ...


Thanks Craig!
via:chughes  cron  fcron  unix  linux  ops  scheduler  automation  scripts 
april 2014 by jm
Linode announces new instance specs
'TL;DR: SSDs + Insane network + Faster processors + Double the RAM + Hourly Billing'
hosting  linode  ssd  performance  linux  ops  datacenters 
april 2014 by jm
s3funnel
'a command line tool for Amazon's Simple Storage Service (S3). Written in Python, easy_install the package to install as an egg. Supports multithreaded operations for large volumes. Put, get, or delete many items concurrently, using a fixed-size pool of threads. Built on workerpool for multithreading and boto for access to the Amazon S3 API. Unix-friendly input and output. Pipe things in, out, and all around.'

MIT-licensed open source. (via Paul Dolan)
via:pdolan  s3  s3funnel  tools  ops  aws  python  mit  open-source 
april 2014 by jm
"H" in cron syntax
This is something Jenkins have come up to randomize and distribute load, in order to avoid the "thundering-herd" bug. Good call
jenkins  randomization  load-balancing  load  thundering-herd  ops  capacity  sleep 
april 2014 by jm
Dead Man's Snitch
a cron job monitoring tool that keeps an eye on your periodic processes and notifies you when something doesn't happen. Daily backups, monthly emails, or cron jobs you need to monitor? Dead Man's Snitch has you covered. Know immediately when one of these processes doesn't work.


via Marc.
alerts  cron  monitoring  sysadmin  ops  backups  alarms 
april 2014 by jm
sysdig
open source, system-level exploration: capture system state and activity from a running Linux instance, then save, filter and analyze.
Think of it as strace + tcpdump + lsof + awesome sauce.
With a little Lua cherry on top.


This sounds excellent. Linux-based, GPLv2.
debugging  tools  linux  ops  tracing  strace  open-source  sysdig  cli  tcpdump  lsof 
april 2014 by jm
Adrian Cockroft's Cloud Outage Reports Collection
The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. [....] I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them.
outages  post-mortems  documentation  ops  aws  ec2  amazon  google  dropbox  microsoft  azure  incident-response 
march 2014 by jm
htcat/htcat
a utility to perform parallel, pipelined execution of a single HTTP GET. htcat is intended for the purpose of incantations like: htcat https://host.net/file.tar.gz | tar -zx

It is tuned (and only really useful) for faster interconnects: [....] 109MB/s on a gigabit network, between an AWS EC2 instance and S3. This represents 91% use of the theoretical maximum of gigabit (119.2 MiB/s).
go  cli  http  file-transfer  ops  tools 
march 2014 by jm
S3QL
a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X.

S3QL is a standard conforming, full featured UNIX file system that is conceptually indistinguishable from any local file system. Furthermore, S3QL has additional features like compression, encryption, data de-duplication, immutable trees and snapshotting which make it especially suitable for online backup and archival.
s3  s3ql  backup  aws  filesystems  linux  freebsd  osx  ops 
march 2014 by jm
ZooKeeper Resilience at Pinterest
essentially decoupling the client services from ZK using a local daemon on each client host; very similar to Airbnb's Smartstack. This is a bit of an indictment of ZK's usability though
ops  architecture  clustering  network  partitions  cap  reliability  smartstack  airbnb  pinterest  zookeeper 
march 2014 by jm
Migrating from MongoDB to Cassandra
Interesting side-effect of using LUKS for full-disk encryption: 'For every disk read, we were pulling in 3MB of data (RA is sectors, SSZ is sector size, 6144*512=3145728 bytes) into cache. Oops. Not only were we doing tons of extra work, but we were trashing our page cache too. The default for the device-mapper used by LUKS under Ubuntu 12.04LTS is incredibly sub-optimal for database usage, especially our usage of Cassandra (more small random reads vs. large rows). We turned this down to 128 sectors — 64KB.'
cassandra  luks  raid  linux  tuning  ops  blockdev  disks  sdd 
february 2014 by jm
Yammer Engineering - Resiliency at Yammer
Not content with adding Hystrix (circuit breakers, threadpooling, request time limiting, metrics, etc.) to their entire SOA stack, they've made it incredibly configurable by hooking in a web-based configuration UI, allowing dynamic on-the-fly reconfiguration by their ops guys of the circuit breakers and threadpools in production. Mad stuff
hystrix  circuit-breakers  resiliency  yammer  ops  threadpools  soa  dynamic-configuration  archaius  netflix 
january 2014 by jm
10 Things We Forgot to Monitor
a list of not-so-common outage causes which are easy to overlook; swap rate, NTP drift, SSL expiration, fork rate, etc.
nagios  metrics  ops  monitoring  systems  ntp  bitly 
january 2014 by jm
Hero Culture
Good description of the "hero coder" organisational antipattern.
Now imagine that most of the team is involved in fire-fighting. New recruits see the older recruits getting praised for their brave work in the line-of-fire and they want that kind of praise and reward too. Before long everyone is focused on putting out fires and it is no ones interest to step back and take on the risks that long-term DevOps-focused goals entail.
coding  ops  admin  hero-coder  hero-culture  firefighting  organisations  teams  culture 
january 2014 by jm
Cassandra: tuning the JVM for read heavy workloads
The cluster we tuned is hosted on AWS and is comprised of 6 hi1.4xlarge EC2 instances, with 2 1TB SSDs raided together in a raid 0 configuration. The cluster’s dataset is growing steadily. At the time of this writing, our dataset is 341GB, up from less than 200GB a few months ago, and is growing by 2-3GB per day. The workload on this cluster is very read heavy, with quorum reads making up 99% of all operations.


Some careful GC tuning here. Probably not applicable to anyone else, but good approach in general.
java  performance  jvm  scaling  gc  tuning  cassandra  ops 
january 2014 by jm
Backblaze Blog » What Hard Drive Should I Buy?
Because Backblaze has a history of openness, many readers expected more details in my previous posts. They asked what drive models work best and which last the longest. Given our experience with over 25,000 drives, they asked which ones are good enough that we would buy them again. In this post, I’ll answer those questions.
backblaze  backup  hardware  hdds  storage  disks  ops  via:fanf 
january 2014 by jm
deploy_to_runit
A nice node.js app to perform continuous deployment from a GitHub repo via its webhook support, from Matt Sergeant
github  node.js  runit  deployment  git  continuous-deployment  devops  ops 
january 2014 by jm
Dr. Bunsen / Time Warp
I use it to modify Time Machine’s backup behavior using weighted reservoir sampling. I built Time Warp to preserve important backup snapshots and prevent Time Machine from deleting them.


via Aman. Nifty!
backup  python  time-machine  decay  exponential-decay  weighting  algorithms  snapshots  ops 
january 2014 by jm
"Understanding the Robustness of SSDs under Power Fault", FAST '13 [paper]
Horrific. SSDs (including "enterprise-class storage") storing sync'd writes in volatile RAM while claiming they were synced; one device losing 72.6GB, 30% of its data, after 8 injected power faults; and all SSDs tested displayed serious errors including random bit errors, metadata corruption, serialization errors and shorn writes. Don't trust lone unreplicated, unbacked-up SSDs!
pdf  papers  ssd  storage  reliability  safety  hardware  ops  usenix  serialization  shorn-writes  bit-errors  corruption  fsync 
january 2014 by jm
The How and Why of Flapjack
Flapjack aims to be a flexible notification system that handles:

Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc);
Alert summarisation (with per-user, per media summary thresholds);
Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc).

Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.
flapjack  notification  alerts  ops  nagios  paging  sensu 
january 2014 by jm
BitCoin exchange CoinBase uses MongoDB as their 'primary datastore'
'Coinbase uses MongoDB for their primary datastore for their web app, api requests, etc.'
coinbase  mongodb  reliability  hn  via:aphyr  ops  banking  bitcoin 
december 2013 by jm
Load Balancer Testing with a Honeypot Daemon
nice post on writing BDD unit tests for infrastructure, in this case specifically a load balancer (via Devops Weekly)
load-balancers  ops  devops  sysadmin  testing  unit-tests  networking  honeypot  infrastructure  bdd 
december 2013 by jm
Cyanite
a metric storage daemon, exposing both a carbon listener and a simple web service. Its aim is to become a simple, scalable and drop-in replacement for graphite's backend.


Pretty alpha for now, but definitely worth keeping an eye on to potentially replace our burgeoning Carbon fleet...
graphite  carbon  cassandra  storage  metrics  ops  graphs  service-metrics 
december 2013 by jm
Kelly "kellabyte" Sommers on Redis' "relaxed CP" approach to the CAP theorem

Similar to ACID properties, if you partially provide properties it means the user has to _still_ consider in their application that the property doesn't exist, because sometimes it doesn't. In you're fsync example, if fsync is relaxed and there are no replicas, you cannot consider the database durable, just like you can't consider Redis a CP system. It can't be counted on for guarantees to be delivered. This is why I say these systems are hard for users to reason about. Systems that partially offer guarantees require in-depth knowledge of the nuances to properly use the tool. Systems that explicitly make the trade-offs in the designs are easier to reason about because it is more obvious and _predictable_.
kellabyte  redis  cp  ap  cap-theorem  consistency  outages  reliability  ops  database  storage  distcomp 
december 2013 by jm
Chef Testing at PagerDuty
Good article on how PagerDuty test their chef changes -- lint, unit tests using ChefSpec, integ tests and their "Failure Friday" game days
testing  chef  ops  devops  chefspec  game-days  pagerduty 
december 2013 by jm
Flock for Cron jobs
good blog post writing up the 'flock -n -c' trick to ensure single-concurrent-process locking for cron jobs
cron  concurrency  unix  linux  flock  locking  ops 
december 2013 by jm
Failure Friday: How We Ensure PagerDuty is Always Reliable
Basically, they run the kind of exercise which Jesse Robbins invented at Amazon -- "Game Days". Scarily, they do these on a Friday -- living dangerously!
game-days  testing  failure  devops  chaos-monkey  ops  exercises 
november 2013 by jm
Rasmus' home NAS design
I'm trying to avoid doing this in order to avoid more power consumption and unpopular hardware in the house -- but if necessary, this is a good up-to-date homebuild design
nas  hardware  home  storage  ops  disks 
november 2013 by jm
Backblaze Blog » How long do disk drives last?
According to Backblaze's data, 80% of drives last 4 years, and the median lifespan is projected to be 6 years
backblaze  storage  disk  ops  mtbf  hardware  failure  lifespan 
november 2013 by jm
10 Things You Should Know About AWS
Some decent tips in here, mainly EC2-focussed
amazon  ec2  aws  ops  rds 
november 2013 by jm
Why Every Company Needs A DevOps Team Now - Feld Thoughts
Bookmarking particularly for the 3 "favourite DevOps patterns":

"Make sure we have environments available early in the Development process"; enforce a policy that the code and environment are tested together, even at the earliest stages of the project; “Wake up developers up at 2 a.m. when they break things"; and "Create reusable deployment procedures".
devops  work  ops  deployment  testing  pager-duty 
november 2013 by jm
Statsite
A C reimplementation of Etsy's statsd, with some interesting memory optimizations.
Statsite is designed to be both highly performant, and very flexible. To achieve this, it implements the stats collection and aggregation in pure C, using libev to be extremely fast. This allows it to handle hundreds of connections, and millions of metrics. After each flush interval expires, statsite performs a fork/exec to start a new stream handler invoking a specified application. Statsite then streams the aggregated metrics over stdin to the application, which is free to handle the metrics as it sees fit. This allows statsite to aggregate metrics and then ship metrics to any number of sinks (Graphite, SQL databases, etc). There is an included Python script that ships metrics to graphite.
statsd  graphite  statsite  performance  statistics  service-metrics  metrics  ops 
november 2013 by jm
Serf
'a service discovery and orchestration tool that is decentralized, highly available, and fault tolerant. Serf runs on every major platform: Linux, Mac OS X, and Windows. It is extremely lightweight: it uses 5 to 10 MB of resident memory and primarily communicates using infrequent UDP messages [and an] efficient gossip protocol.'
clustering  service-discovery  ops  linux  gossip  broadcast  clusters 
november 2013 by jm
Counterfactual Thinking, Rules, and The Knight Capital Accident
John Allspaw with an interesting post on the Knight Capital disaster
john-allspaw  ops  safety  post-mortems  engineering  procedures 
october 2013 by jm
Airbnb's Smartstack
Service discovery a la Airbnb -- Nerve and Synapse: two external daemons that run on each node, Nerve to manage registration in Zookeeper, and Synapse to generate a haproxy configuration file from that, running on each host, allowing connections to all other hosts.
haproxy  services  ops  load-balancing  service-discovery  nerve  synapse  airbnb 
october 2013 by jm
Basho and Seagate partner to deliver scale-out cloud storage breakthrough
Ha, cool. Skip the OS, write the Riak store natively to the drive. This sounds frankly terrifying ;)
The Seagate Kinetic Open Storage platform eliminates the storage server tier of traditional data center architectures by enabling applications to speak directly to the storage system, thereby reducing expenses associated with the acquisition, deployment, and support of hyperscale storage infrastructures. The platform leverages Seagate’s expertise in hardware and software storage systems integrating an open source API and Ethernet connectivity with Seagate hard drive technology.
seagate  basho  riak  storage  hardware  drivers  os  ops 
october 2013 by jm
How to lose $172,222 a second for 45 minutes
Major outage and $465m of trading loss, caused by staggeringly inept software management: 8 years of incremental bitrot, technical debt, and failure to have correct processes to engage an ops team in incident response. Hopefully this will serve as a lesson that software is more than just coding, at least to one industry
trading  programming  coding  software  inept  fail  bitrot  tech-debt  ops  incident-response 
october 2013 by jm
"Toy Story 2" was almost entirely deleted by accident at one point
A stray "rm -rf" on the main network share managed to wipe out 90% of the movie's assets, and the backups were corrupt. Horrific backups war story
movies  ops  backups  pixar  recovery  accidents  rm-rf  delete 
october 2013 by jm
Introducing Chaos to C*
Autoremediation, ie. auto-replacement, of Cassandra nodes in production at Netflix
ops  autoremediation  outages  remediation  cassandra  storage  netflix  chaos-monkey 
october 2013 by jm
"What Should I Monitor?"
slides (lots of slides) from Baron Schwartz' talk at Velocity in NYC.
slides  monitoring  metrics  ops  devops  baron-schwartz  pdf  capacity 
october 2013 by jm
pt-summary
from the Percona toolkit. 'Conveniently summarizes the status and configuration of a server. It is not a tuning tool or diagnosis tool. It produces a report that is easy to diff and can be pasted into emails without losing the formatting. This tool works well on many types of Unix systems.' --- summarises OOM history, top, netstat connection table, interface stats, network config, RAID, LVM, disks, inodes, disk scheduling, mounts, memory, processors, and CPU.
percona  tools  cli  unix  ops  linux  diagnosis  raid  netstat  oom 
october 2013 by jm
Mesosphere · Docker on Mesos
This is cool. Deploy Docker container images onto a Mesos cluster: key point, in the description of the Redis example: 'there’s no need to install Redis or its supporting libraries on your Mesos hosts.'
mesos  docker  deployment  ops  images  virtualization  containers  linux 
september 2013 by jm
Getting Real About Distributed System Reliability
I have come around to the view that the real core difficulty of [distributed] systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters. Worse yet, you can’t easily buy good operations in the same way you can buy good software—you might be able to hire good people (if you can find them) but this is more than just people; it is practices, monitoring systems, configuration management, etc.
reliability  nosql  distributed-systems  jay-kreps  ops 
september 2013 by jm
Creating Flight Recordings
lots more detail on the new "Java Mission Control" feature in Hotspot 7u40 JVMs, and how to use it to start and stop profiling in a live, production JVM from a separate "jcmd" command-line client. If the overhead is small, this could be really neat -- turn on profiling for 1 minute every hour on a single instance, and collect realtime production profile data on an automated basis for post-facto analysis if required
instrumentation  logging  profiling  java  jvm  ops 
september 2013 by jm
Voldemort on Solid State Drives [paper]
'This paper and talk was given by the LinkedIn Voldemort Team at the Workshop on Big Data Benchmarking (WBDB May 2012).'

With SSD, we find that garbage collection will become a very significant bottleneck, especially for systems which have little control over the storage layer and rely on Java memory management. Big heapsizes make the cost of garbage collection expensive, especially the single threaded CMS Initial mark. We believe that data systems must revisit their caching strategies with SSDs. In this regard, SSD has provided an efficient solution for handling fragmentation and moving towards predictable multitenancy.
voldemort  storage  ssd  disk  linkedin  big-data  jvm  tuning  ops  gc 
september 2013 by jm
Interview with the Github Elasticsearch Team
good background on Github's Elasticsearch scaling efforts. Some rather horrific split-brain problems under load, and crashes due to OpenJDK bugs (sounds like OpenJDK *still* isn't ready for production). painful
elasticsearch  github  search  ops  scaling  split-brain  outages  openjdk  java  jdk  jvm 
september 2013 by jm
Docker: Git for deployment
Docker is to deployment as Git is to development.

Developers are able to leverage Git's performance and flexibility when building applications. Git encourages experiments and doesn't punish you when things go wrong: start your experiments in a branch, if things fall down, just git rebase or git reset. It's easy to start a branch and fast to push it.

Docker encourages experimentation for operations. Containers start quickly. Building images is a snap. Using another images as a base image is easy. Deploying whole images is fast, and last but not least, it's not painful to rollback.

Fast + flexible = deployments are about to become a lot more enjoyable.
docker  deployment  sysadmin  ops  devops  vms  vagrant  virtualization  containers  linux  git 
august 2013 by jm
Juniper Adds Puppet support
This is super-cool.

'Network engineering no longer should be mundane tasks like conf, set interfaces fe-0/0/0 unit o family inet address 10.1.1.1/24. How does mindless CLI work translate to efficiently spent time ? What if you need to change 300 devices? What if you are writing it by hand? An error-prone waste of time. Juniper today announced Puppet support for their 12.2R3,5 JUNOS code. This is compatible with EX4200, EX4550, and QFX3500 switches. These are top end switches, but this start is directly aimed at their DC and enterprise devices. Initially, the manifest interactions offered are interface, layer 2 interface, vlan, port aggregation groups, and device names.'

Based on what I saw in the Network Automation team in Amazon, this is an amazing leap forward; it'd instantly render obsolete a bunch of horrific SSH-CLI automation cruft.
ssh  cli  automation  networking  networks  puppet  ops  juniper  cisco 
august 2013 by jm
Information on Google App Engine's recent US datacenter relocations - Google Groups
or, really, 'why we had some glitches and outages recently'. A few interesting tidbits about GAE innards though (via Bill De hOra)
gae  google  app-engine  outages  ops  paxos  eventual-consistency  replication  storage  hrd 
august 2013 by jm
« earlier      
per page:    204080120160

related tags

accidents  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  airbnb  alarming  alarms  alert-logic  alerting  alerts  alestic  algorithms  alter-table  ama  amazon  analytics  anomaly-detection  anti-spam  antipatterns  ap  apache  aphyr  app-engine  archaius  architecture  asgard  authentication  automation  autoremediation  autoscaling  availability  aws  az  azure  backblaze  backup  backups  banking  baron-schwartz  basho  bdb  bdb-je  bdd  beanstalk  ben-treynor  benchmarks  best-practices  big-data  billing  bit-errors  bitcoin  bitly  bitrot  blockdev  blogs  boundary  broadcast  bugs  build  build-out  ca-7  caching  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  case-studies  cassandra  censum  cfengine  change-monitoring  chaos-monkey  checklists  chef  chefspec  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cli  cloud  cloudwatch  cluster  clustering  clusters  cms  code-spaces  coding  coinbase  collaboration  command-line  commercial  company  compression  concurrency  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy-on-write  coreutils  corruption  cp  crash-only-software  cron  culture  daemon  daemons  dashboards  data  data-corruption  database  databases  datacenters  dataviz  debug  debugging  decay  delete  delivery  demo  dependencies  deploy  deployinator  deployment  desktops  dev  developers  devops  diagnosis  digital-ocean  disk  disks  distcomp  distributed  distributed-systems  dns  docker  documentation  dotcloud  drivers  dropbox  dstat  duplicity  duply  dynamic-configuration  dynect  ebs  ec2  elastic-scaling  elasticsearch  email  engineering  ensemble  erasure-coding  etsy  eventual-consistency  exercises  exponential-decay  extortion  fabric  facebook  fail  failure  false-positives  fault-tolerance  fcron  file-transfer  filesystems  firefighting  flapjack  flock  forecasting  foursquare  freebsd  front-ends  fs  fsync  ftrace  g1  g1gc  gae  game-days  gc  gilt-groupe  git  github  go  god  google  gossip  graphing  graphite  graphs  gzip  hacks  hadoop  haproxy  hardware  hdds  heap  hero-coder  hero-culture  hn  holt-winters  home  honeypot  hosting  hotspot  hrd  http  hystrix  ian-wilkes  ibm  images  incident-response  inept  infrastructure  instrumentation  internet  interviews  io  iops  iostat  ioutil  ip  iptables  ironfan  java  jay-kreps  jdk  jenkins  jepsen  jmx  jmxtrans  john-allspaw  juniper  jvm  kafka  kdd  kde  kellabyte  knife  laptops  latency  legacy  leveldb  lifespan  limits  linden  linkedin  links  linode  linux  live  load  load-balancers  load-balancing  locking  logging  loose-coupling  lsb  lsof  luks  lxc  mac  machine-learning  macosx  maintainance  map-reduce  measurements  memory  mesos  metrics  mfa  microsoft  migrations  mirroring  mit  mongodb  monitorama  monitoring  movies  mtbf  mysql  nagios  nannies  nas  natwest  nerve  netflix  netstat  network  network-monitoring  network-partitions  networking  networks  nginx  node.js  nosql  notification  notifications  ntp  ntpd  obama  omniti  oom  open-source  openjdk  operations  ops  optimization  organisations  os  osx  ouch  out-of-band  outage  outages  outsourcing  pager-duty  pagerduty  paging  papers  partition  partitions  passenger  paxos  pbailis  pdf  percona  performance  phusion  pie  pillar  pinterest  piops  pixar  plumbr.eu  post-mortems  postgres  postmortems  presentations  prioritisation  procedures  processes  procfs  production  profiling  programming  provisioning  pty  puppet  python  queueing  rabbitmq  raid  rails  randomization  ranking  rate-limiting  rbs  rdbms  rds  recovery  reddit  redis  redshift  refactoring  reference  reliability  remediation  replicas  replication  resiliency  restoring  riak  risks  rm-rf  root-cause  route53  routing  rspec  ruby  runbooks  runit  s3  s3funnel  s3ql  safety  sanity-checks  scala  scalability  scaling  scheduler  schema  scripts  sdd  seagate  search  security  sensu  serf  serialization  server  servers  serverspec  service-discovery  service-metrics  services  ses  sharding  shorn-writes  silos  sleep  slew  slides  smartstack  smtp  snappy  snapshots  sns  soa  software  soundcloud  space  split-brain  sql  sre  ssd  ssh  stack  statistics  stats  statsd  statsite  stepping  storage  strace  streaming  strider  supervision  supervisord  support  survey  svctm  syadmin  synapse  sysadmin  sysdig  system-testing  systemd  systems  tahoe-lafs  tcp  tcpdump  tdd  teams  tech-debt  testing  threadpools  throughput  thundering-herd  time  time-machine  time-synchronization  tips  tools  trace  tracer-requests  tracing  trading  training  troubleshooting  tuning  turing-complete  twilio  twitter  two-factor-authentication  ubuntu  ui  ulster-bank  ultradns  unicorn  unit-testing  unit-tests  unix  upstart  usenix  vagrant  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:fanf  via:filippo  via:jk  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  virtualisation  virtualization  vm  vms  voldemort  vpc  web  web-services  weighting  wiki  work  yammer  zipkin  zookeeper  zooko 

Copy this bookmark:



description:


tags: