jm + ops   171

Is Docker ready for production? Feedbacks of a 2 weeks hands on
I have to agree with this assessment -- there are a lot of loose ends still for production use of Docker in a SOA stack environment:
From my point of view, Docker is probably the best thing I’ve seen in ages to automate a build. It allows to pre build and reuse shared dependencies, ensuring they’re up to date and reducing your build time. It avoids you to either pollute your Jenkins environment or boot a costly and slow Virtualbox virtual machine using Vagrant. But I don’t feel like it’s production ready in a complex environment, because it adds too much complexity. And I’m not even sure that’s what it was designed for.
docker  complexity  devops  ops  production  deployment  soa  web-services  provisioning  networking  logging 
18 hours ago by jm
Linus Torvalds and others on Linux's systemd
ZDNet's Steven J. Vaughan-Nichols on the systemd mess (via Kragen)
via:kragen  systemd  linux  ubuntu  gnome  init  ops 
8 days ago by jm
cAPTain
a simple, lightweight HTTP server for storing and distributing custom Debian packages around your organisation. It is designed to make it as easy as possible to use Debian packages for code deployments and to ease other system administration tasks.
debian  apt  sysadmin  linux  ops  packaging 
8 days ago by jm
Netflix release new code to production before completing tests
Interesting -- I hadn't heard of this being an official practise anywhere before (although we actually did it ourselves this week)...
If a build has made it [past the 'integration test' phase], it is ready to be deployed to one or more internal environments for user-acceptance testing. Users could be UI developers implementing a new feature using the API, UI Testers performing end-to-end testing or automated UI regression tests. As far as possible, we strive to not have user-acceptance tests be a gating factor for our deployments. We do this by wrapping functionality in Feature Flags so that it is turned off in Production while testing is happening in other environments. 
devops  deployment  feature-flags  release  testing  integration-tests  uat  qa  production  ops  gating  netflix 
11 days ago by jm
"Linux Containers And The Future Cloud" [slides]
by Rami Rosen -- extremely detailed presentation into the state of Linux containers, LXC, Docker, namespaces, cgroups, and checkpoint/restore in userspace (via lusis)
lsx  docker  criu  namespaces  cgroups  linux  via:lusis  ops  containers  rami-rosen  presentations 
14 days ago by jm
Mike Perham on Twitter: "Sweet, monit just sent a DMCA takedown notice to @github to remove Inspeqtor."
'The work, Inspeqtor which is hosted at GitHub, is far from a “clean-room” implementation. This is basically a rewrite of Monit in Go, even using the same configuration language that is used in Monit, verbatim.

a. [private] himself admits that Inspeqtor is "heavily influenced“ by Monit https://github.com/mperham/inspeqtor/wiki/Other-Solutions.

b. This tweet by [private] demonstrate intent. https://twitter.com/mperham/status/452160352940064768 "OSS nerds: redesign and build monit in Go. Sell it commercially. Make $$$$. I will be your first customer.”'

IANAL, but using the same config language does not demonstrate copyright infringement...
copyright  dmca  tildeslash  monit  inspeqtor  github  ops  oss  agpl 
15 days ago by jm
Zonify
'a set of command line tools for managing Route53 DNS for an AWS infrastructure. It intelligently uses tags and other metadata to automatically create the associated DNS records.'
zonify  aws  dns  ec2  route53  ops 
15 days ago by jm
Mandos
'a system for allowing servers with encrypted root file systems to reboot unattended and/or remotely.' (via Tony Finch)
via:fanf  mandos  encryption  security  server  ops  sysadmin  linux 
15 days ago by jm
The End of Linux
'Linux is becoming the thing that we adopted Linux to get away from.'

Great post on the horrible complexity of systemd. It reminds me of nothing more than mid-90s AIX, which I had the displeasure of opsing for a while -- the Linux distros have taken a very wrong turn here.
linux  unix  complexity  compatibility  ops  rant  systemd  bloat  aix 
24 days ago by jm
Inviso: Visualizing Hadoop Performance
With the increasing size and complexity of Hadoop deployments, being able to locate and understand performance is key to running an efficient platform.  Inviso provides a convenient view of the inner workings of jobs and platform.  By simply overlaying a new view on existing infrastructure, Inviso can operate inside any Hadoop environment with a small footprint and provide easy access and insight.  


This sounds pretty useful.
inviso  netflix  hadoop  emr  performance  ops  tools 
24 days ago by jm
Avoiding Chef-Suck with Auto Scaling Groups - forty9ten
Some common problems which arise using Chef with ASGs in EC2, and how these guys avoided it -- they stopped using Chef for service provisioning, and instead baked AMIs when a new version was released. ASGs using pre-baked AMIs definitely works well so this makes good sense IMO.
infrastructure  chef  ops  asg  auto-scaling  ec2  provisioning  deployment 
28 days ago by jm
pcstat
get page cache statistics for files.
A common question when tuning databases and other IO-intensive applications is, "is Linux caching my data or not?" pcstat gets that information for you using the mincore(2) syscall. I wrote this is so that Apache Cassandra users can see if ssTables are being cached.
linux  page-cache  caching  go  performance  cassandra  ops  mincore  fincore 
4 weeks ago by jm
Troubleshooting Production JVMs with jcmd
remotely trigger GCs, finalization, heap dumps etc. Handy
jvm  jcmd  debugging  ops  java  gc  heap  troubleshooting 
4 weeks ago by jm
The State of ZFS on Linux
Linux users familiar with other filesystems or ZFS users from other platforms will often ask whether ZFS on Linux (ZoL) is “stable”. The short answer is yes, depending on your definition of stable. The term stable itself is somewhat ambiguous.


Oh dear. that's not a good start. Good reference page, though
zfs  linux  filesystems  ops  solaris 
5 weeks ago by jm
Mail-in-a-Box
'turns a fresh cloud computer into a working mail server. You get contact synchronization, spam filtering, and so on. On your phone, you can use apps like K-9 Mail and CardDAV-Sync free beta to sync your email and contacts between your phone and your box.'

(via Tony Finch)
via:fanf  mail  diy  hosting  webmail  ops 
6 weeks ago by jm
Using spot instances
Excellent post on all of the ins and outs of EC2 spot instance usage
ec2  aws  spot-instances  pricing  cloud  auto-scaling  ops 
6 weeks ago by jm
Nix: The Purely Functional Package Manager
'a powerful package manager for Linux and other Unix systems that makes package management reliable and reproducible. It provides atomic upgrades and rollbacks, side-by-side installation of multiple versions of a package, multi-user package management and easy setup of build environments. '

Basically, this is a third-party open source reimplementation of Amazon's (excellent) internal packaging system, using symlinks to versioned package directories to ensure atomicity and the ability to roll back. This is definitely the *right* way to build packages -- I know what tool I'll be pushing for, next time this question comes up.

See also nixos.org for a Linux distro built on Nix.
ops  linux  devops  unix  packaging  distros  nix  nixos  atomic  upgrades  rollback  versioning 
7 weeks ago by jm
Applying cardiac alarm management techniques to your on-call
An ops-focused take on a recent story about alarm fatigue, and how a Boston hospital dealt with it. When I was in Amazon, many of the teams in our division had a target to reduce false positive pages, with a definite monetary value attached to it, since many teams had "time off in lieu" payments for out-of-hours pages to the on-call staff. As a result, reducing false-positive pages was reasonably high priority and we dealt with this problem very proactively, with a well-developed sense of how to do so. It's interesting to see how the outside world is only just starting to look into its amelioration. (Another benefit of a TOIL policy ;)
ops  monitoring  sysadmin  alerts  alarms  nagios  alarm-fatigue  false-positives  pages 
7 weeks ago by jm
On-Demand Jenkins Slaves With Amazon EC2
This is very likely where we'll be going for our acceptance tests in Swrve
testing  jenkins  ec2  spot-instances  scalability  auto-scaling  ops  build 
7 weeks ago by jm
Apache Kafka 0.8 basic training
This is a pretty voluminous and authoritative presentation about getting started with Kafka; wish this was around when we started using it for 0.7. (We use our own homegrown realtime system nowadays, due to better partitioning, monitoring and operability.)
storm  kafka  presentations  documentation  ops 
7 weeks ago by jm
Logentries Announces Machine Learning Analytics for IT Ops Monitoring and Real-time Alerting
This sounds pretty neat:
With Logentries Anomaly Detection, users can:

Set-up real-time alerting based on deviations from important patterns and log events.
Easily customize Anomaly thresholds and compare different time periods.

With Logentries Inactivity Alerting, users can:

Monitor standard, incoming events such as an application heart beat.
Receive real-time alerts based on log inactivity (i.e. receive alerts when something does not occur).
logging  syslog  logentries  anomaly-detection  ops  machine-learning  inactivity  alarms  alerting  heartbeats 
8 weeks ago by jm
Box Tech Blog » A Tale of Postmortems
How Box introduced COE-style dev/ops outage postmortems, and got them working. This PIE metric sounds really useful to head off the dreaded "it'll all have to come out missus" action item:
The picture was getting clearer, and we decided to look into individual postmortems and action items and see what was missing. As it was, action items were wasting away with no owners. Digging deeper, we noticed that many action items entailed massive refactorings or vague requirements like “make system X better” (i.e. tasks that realistically were unlikely to be addressed). At a higher level, postmortem discussions often devolved into theoretical debates without a clear outcome. We needed a way to lower and focus the postmortem bar and a better way to categorize our action items and our technical debt.

Out of this need, PIE (“Probability of recurrence * Impact of recurrence * Ease of addressing”) was born. By ranking each factor from 1 (“low”) to 5 (“high”), PIE provided us with two critical improvements:

1. A way to police our postmortems discussions. I.e. a low probability, low impact, hard to implement solution was unlikely to get prioritized and was better suited to a discussion outside the context of the postmortem. Using this ranking helped deflect almost all theoretical discussions.
2. A straightforward way to prioritize our action items.

What’s better is that once we embraced PIE, we also applied it to existing tech debt work. This was critical because we could now prioritize postmortem action items alongside existing work. Postmortem action items became part of normal operations just like any other high-priority work.
postmortems  action-items  outages  ops  devops  pie  metrics  ranking  refactoring  prioritisation  tech-debt 
9 weeks ago by jm
The Network is Reliable - ACM Queue
Peter Bailis and Kyle Kingsbury accumulate a comprehensive, informal survey of real-world network failures observed in production. I remember that April 2011 EBS outage...
ec2  aws  networking  outages  partitions  jepsen  pbailis  aphyr  acm-queue  acm  survey  ops 
12 weeks ago by jm
iosnoop For Linux
it's a shell script! ftrace-based tool to snoop on Linux disk I/O and trace system-wide activity, more-or-less attributing it to the correct process
linux  disk  io  tracing  trace  ops  ftrace 
july 2014 by jm
Boundary's new server monitoring free offering
'High resolution, 1 second intervals for all metrics; Fluid analytics, drag any graph to any point in time; Smart alarms to cut down on false positives; Embedded graphs and customizable dashboards; Up to 10 servers for free'

Pre-registration is open now. Could be interesting, although the limit of 10 machines is pretty small for any production usage
boundary  monitoring  network  ops  metrics  alarms  tcp  ip  netstat 
july 2014 by jm
Xfennec/cv
'This tool can be described as a Tiny Dirty Linux Only C command that looks for coreutils basic commands (cp, mv, dd, tar, gzip/gunzip, cat, ...) currently running on your system and displays the percentage of copied data. It can now also display an estimated throughput (using -w flag).'
coreutils  via:pixelbeat  linux  ops  hacks  procfs  dataviz  unix 
july 2014 by jm
Latest EBS tuning tips
from yesterday's AWS Summit in NYC:

Cheat sheet of EBS-optimized instances. http://t.co/vmTlhUtpWk
Optimize your queue depth to achieve lower latency & highest IOPS. http://t.co/EO48oa0D6X
When configuring your RAID, use a stripe size of 128KB or 256KB. http://t.co/N0ldtFJ4t6
Use larger block size to speed up the pre-warming process. http://t.co/8UoIeWE2px
ebs  aws  amazon  iops  raid  ops  tuning 
july 2014 by jm
Two traps in iostat: %util and svctm
Marc Brooker:
As a measure of general IO busyness %util is fairly handy, but as an indication of how much the system is doing compared to what it can do, it's terrible. Iostat's svctm has even fewer redeeming strengths. It's just extremely misleading for most modern storage systems and workloads. Both of these fields are likely to mislead more than inform on modern SSD-based storage systems, and their use should be treated with extreme care.
ioutil  iostat  svctm  ops  ssd  disks  hardware  metrics  stats  linux 
july 2014 by jm
Tessera
Urban Airship with a new open-source Graphite front-end UI; similar enough to Grafana at a glance, no releases yet, ASL2-licensed
graphite  metrics  ui  front-ends  open-source  ops 
july 2014 by jm
Delivery Notifications for Simple Email Service
Today we are enhancing SES with the addition of delivery notifications. You can now elect to receive an Amazon SNS notification each time SES successfully delivers a message to a recipient's email server. These notifications give you increased visibility into the mail delivery process. With today's release, you can now track deliveries, bounces, and complaints, all via notification to the SNS topic or topics of your choice.
delivery  email  smtp  ses  aws  sns  notifications  ops 
june 2014 by jm
Amazon EC2 Service Limits Report Now Available
'designed to make it easier for you to view and manage your limits for Amazon EC2 by providing the latest information on service limits and links to quickly request limit increases. EC2 Service Limits Report displays all your service limit information in one place to help you avoid encountering limits on future EC2, EBS, Auto Scaling, and VPC usage.'
aws  ec2  vpc  ebs  autoscaling  limits  ops 
june 2014 by jm
Code Spaces data and backups deleted by hackers
Rather scary story of an extortionist wiping out a company's AWS-based infrastructure. Turns out S3 supports MFA-required deletion as a feature, though, which would help against that.
ops  security  extortion  aws  ec2  s3  code-spaces  delete  mfa  two-factor-authentication  authentication  infrastructure 
june 2014 by jm
Call me maybe: Elasticsearch
Wow, these are terrible results. From the sounds of it, ES just cannot deal with realistic outage scenarios and is liable to suffer catastrophic damage in reasonably-common partitions.
If you are an Elasticsearch user (as I am): good luck. Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present. If you can, store your data in a safer database, and feed it into Elasticsearch gradually. Have processes in place that continually traverse the system of record, so you can recover from ES data loss automatically.
elasticsearch  ops  storage  databases  jepsen  partition  network  outages  reliability 
june 2014 by jm
Call me maybe: RabbitMQ
We used Knossos and Jepsen to prove the obvious: RabbitMQ is not a lock service. That investigation led to a discovery hinted at by the documentation: in the presence of partitions, RabbitMQ clustering will not only deliver duplicate messages, but will also drop huge volumes of acknowledged messages on the floor. This is not a new result, but it may be surprising if you haven’t read the docs closely–especially if you interpreted the phrase “chooses Consistency and Partition Tolerance” to mean, well, either of those things.
rabbitmq  network  partitions  failure  cap-theorem  consistency  ops  reliability  distcomp  jepsen 
june 2014 by jm
Pillar
Manages migrations for your Cassandra data stores. Pillar grew from a desire to automatically manage Cassandra schema as code. Managing schema as code enables automated build and deployment, a foundational practice for an organization striving to achieve Continuous Delivery.

Pillar is to Cassandra what Rails ActiveRecord migrations or Play Evolutions are to relational databases with one key difference: Pillar is completely independent from any application development framework.
migrations  database  ops  pillar  cassandra  activerecord  scala  continuous-delivery  automation  build 
june 2014 by jm
Plumbr.eu's reference page for java.lang.OutOfMemoryError
With examples of each possible cause of a Java OOM, and suggested workarounds. succinct
reference  plumbr.eu  oom  java  memory  heap  ops 
june 2014 by jm
Database Migrations Done Right
The rule is simple. You should never tie database migrations to application deploys or vice versa. By minimising dependencies you enable faster, easier and cleaner deployments.


A solid description of why this is a good idea, from an ex-Guardian dev.
migrations  database  sql  mysql  postgres  deployment  ops  dependencies  loose-coupling 
may 2014 by jm
interview with Google VP of SRE Ben Treynor
interviewed by Niall Murphy, no less ;). Some good info on what Google deems important from an ops/SRE perspective
sre  ops  devops  google  monitoring  interviews  ben-treynor 
may 2014 by jm
Docker Plugin for Jenkins
The aim of the docker plugin is to be able to use a docker host to dynamically provision a slave, run a single build, then tear-down that slave. Optionally, the container can be committed, so that (for example) manual QA could be performed by the container being imported into a local docker provider, and run from there.


The holy grail of Jenkins/Docker integration. How cool is that...
jenkins  docker  ops  testing  ec2  hosting  scaling  elastic-scaling  system-testing 
may 2014 by jm
SmartStack vs. Consul
One of the SmartStack developers at AirBNB responds to Consul.io's comments. FWIW, we use SmartStack in Swrve and it works pretty well...
smartstack  airbnb  ops  consul  serf  load-balancing  availability  resiliency  network-partitions  outages 
may 2014 by jm
Go: Best Practices for Production Environments
how Soundcloud deploy their Go services, after 2.5 years of Go in production
go  tips  deployment  best-practices  soundcloud  ops 
april 2014 by jm
AWS Elastic Beanstalk for Docker
This is pretty amazing. nice work, Beanstalk team. not sure how well it integrates with the rest of AWS though
aws  amazon  docker  ec2  beanstalk  ops  containers  linux 
april 2014 by jm
fcron
Fcron is a scheduler. It aims at replacing Vixie Cron, so it implements most of its functionalities. But contrary to Vixie Cron, fcron does not need your system to be up 7 days a week, 24 hours a day : it also works well with systems which are running only occasionnally (contrary to anacrontab). In other words, fcron does both the job of Vixie Cron and anacron, but does even more and better :)) ...


Thanks Craig!
via:chughes  cron  fcron  unix  linux  ops  scheduler  automation  scripts 
april 2014 by jm
Linode announces new instance specs
'TL;DR: SSDs + Insane network + Faster processors + Double the RAM + Hourly Billing'
hosting  linode  ssd  performance  linux  ops  datacenters 
april 2014 by jm
s3funnel
'a command line tool for Amazon's Simple Storage Service (S3). Written in Python, easy_install the package to install as an egg. Supports multithreaded operations for large volumes. Put, get, or delete many items concurrently, using a fixed-size pool of threads. Built on workerpool for multithreading and boto for access to the Amazon S3 API. Unix-friendly input and output. Pipe things in, out, and all around.'

MIT-licensed open source. (via Paul Dolan)
via:pdolan  s3  s3funnel  tools  ops  aws  python  mit  open-source 
april 2014 by jm
"H" in cron syntax
This is something Jenkins have come up to randomize and distribute load, in order to avoid the "thundering-herd" bug. Good call
jenkins  randomization  load-balancing  load  thundering-herd  ops  capacity  sleep 
april 2014 by jm
Dead Man's Snitch
a cron job monitoring tool that keeps an eye on your periodic processes and notifies you when something doesn't happen. Daily backups, monthly emails, or cron jobs you need to monitor? Dead Man's Snitch has you covered. Know immediately when one of these processes doesn't work.


via Marc.
alerts  cron  monitoring  sysadmin  ops  backups  alarms 
april 2014 by jm
sysdig
open source, system-level exploration: capture system state and activity from a running Linux instance, then save, filter and analyze.
Think of it as strace + tcpdump + lsof + awesome sauce.
With a little Lua cherry on top.


This sounds excellent. Linux-based, GPLv2.
debugging  tools  linux  ops  tracing  strace  open-source  sysdig  cli  tcpdump  lsof 
april 2014 by jm
Adrian Cockroft's Cloud Outage Reports Collection
The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. [....] I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them.
outages  post-mortems  documentation  ops  aws  ec2  amazon  google  dropbox  microsoft  azure  incident-response 
march 2014 by jm
htcat/htcat
a utility to perform parallel, pipelined execution of a single HTTP GET. htcat is intended for the purpose of incantations like: htcat https://host.net/file.tar.gz | tar -zx

It is tuned (and only really useful) for faster interconnects: [....] 109MB/s on a gigabit network, between an AWS EC2 instance and S3. This represents 91% use of the theoretical maximum of gigabit (119.2 MiB/s).
go  cli  http  file-transfer  ops  tools 
march 2014 by jm
S3QL
a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X.

S3QL is a standard conforming, full featured UNIX file system that is conceptually indistinguishable from any local file system. Furthermore, S3QL has additional features like compression, encryption, data de-duplication, immutable trees and snapshotting which make it especially suitable for online backup and archival.
s3  s3ql  backup  aws  filesystems  linux  freebsd  osx  ops 
march 2014 by jm
ZooKeeper Resilience at Pinterest
essentially decoupling the client services from ZK using a local daemon on each client host; very similar to Airbnb's Smartstack. This is a bit of an indictment of ZK's usability though
ops  architecture  clustering  network  partitions  cap  reliability  smartstack  airbnb  pinterest  zookeeper 
march 2014 by jm
Migrating from MongoDB to Cassandra
Interesting side-effect of using LUKS for full-disk encryption: 'For every disk read, we were pulling in 3MB of data (RA is sectors, SSZ is sector size, 6144*512=3145728 bytes) into cache. Oops. Not only were we doing tons of extra work, but we were trashing our page cache too. The default for the device-mapper used by LUKS under Ubuntu 12.04LTS is incredibly sub-optimal for database usage, especially our usage of Cassandra (more small random reads vs. large rows). We turned this down to 128 sectors — 64KB.'
cassandra  luks  raid  linux  tuning  ops  blockdev  disks  sdd 
february 2014 by jm
Yammer Engineering - Resiliency at Yammer
Not content with adding Hystrix (circuit breakers, threadpooling, request time limiting, metrics, etc.) to their entire SOA stack, they've made it incredibly configurable by hooking in a web-based configuration UI, allowing dynamic on-the-fly reconfiguration by their ops guys of the circuit breakers and threadpools in production. Mad stuff
hystrix  circuit-breakers  resiliency  yammer  ops  threadpools  soa  dynamic-configuration  archaius  netflix 
january 2014 by jm
10 Things We Forgot to Monitor
a list of not-so-common outage causes which are easy to overlook; swap rate, NTP drift, SSL expiration, fork rate, etc.
nagios  metrics  ops  monitoring  systems  ntp  bitly 
january 2014 by jm
Hero Culture
Good description of the "hero coder" organisational antipattern.
Now imagine that most of the team is involved in fire-fighting. New recruits see the older recruits getting praised for their brave work in the line-of-fire and they want that kind of praise and reward too. Before long everyone is focused on putting out fires and it is no ones interest to step back and take on the risks that long-term DevOps-focused goals entail.
coding  ops  admin  hero-coder  hero-culture  firefighting  organisations  teams  culture 
january 2014 by jm
Cassandra: tuning the JVM for read heavy workloads
The cluster we tuned is hosted on AWS and is comprised of 6 hi1.4xlarge EC2 instances, with 2 1TB SSDs raided together in a raid 0 configuration. The cluster’s dataset is growing steadily. At the time of this writing, our dataset is 341GB, up from less than 200GB a few months ago, and is growing by 2-3GB per day. The workload on this cluster is very read heavy, with quorum reads making up 99% of all operations.


Some careful GC tuning here. Probably not applicable to anyone else, but good approach in general.
java  performance  jvm  scaling  gc  tuning  cassandra  ops 
january 2014 by jm
Backblaze Blog » What Hard Drive Should I Buy?
Because Backblaze has a history of openness, many readers expected more details in my previous posts. They asked what drive models work best and which last the longest. Given our experience with over 25,000 drives, they asked which ones are good enough that we would buy them again. In this post, I’ll answer those questions.
backblaze  backup  hardware  hdds  storage  disks  ops  via:fanf 
january 2014 by jm
deploy_to_runit
A nice node.js app to perform continuous deployment from a GitHub repo via its webhook support, from Matt Sergeant
github  node.js  runit  deployment  git  continuous-deployment  devops  ops 
january 2014 by jm
Dr. Bunsen / Time Warp
I use it to modify Time Machine’s backup behavior using weighted reservoir sampling. I built Time Warp to preserve important backup snapshots and prevent Time Machine from deleting them.


via Aman. Nifty!
backup  python  time-machine  decay  exponential-decay  weighting  algorithms  snapshots  ops 
january 2014 by jm
"Understanding the Robustness of SSDs under Power Fault", FAST '13 [paper]
Horrific. SSDs (including "enterprise-class storage") storing sync'd writes in volatile RAM while claiming they were synced; one device losing 72.6GB, 30% of its data, after 8 injected power faults; and all SSDs tested displayed serious errors including random bit errors, metadata corruption, serialization errors and shorn writes. Don't trust lone unreplicated, unbacked-up SSDs!
pdf  papers  ssd  storage  reliability  safety  hardware  ops  usenix  serialization  shorn-writes  bit-errors  corruption  fsync 
january 2014 by jm
The How and Why of Flapjack
Flapjack aims to be a flexible notification system that handles:

Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc);
Alert summarisation (with per-user, per media summary thresholds);
Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc).

Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.
flapjack  notification  alerts  ops  nagios  paging  sensu 
january 2014 by jm
BitCoin exchange CoinBase uses MongoDB as their 'primary datastore'
'Coinbase uses MongoDB for their primary datastore for their web app, api requests, etc.'
coinbase  mongodb  reliability  hn  via:aphyr  ops  banking  bitcoin 
december 2013 by jm
Load Balancer Testing with a Honeypot Daemon
nice post on writing BDD unit tests for infrastructure, in this case specifically a load balancer (via Devops Weekly)
load-balancers  ops  devops  sysadmin  testing  unit-tests  networking  honeypot  infrastructure  bdd 
december 2013 by jm
Cyanite
a metric storage daemon, exposing both a carbon listener and a simple web service. Its aim is to become a simple, scalable and drop-in replacement for graphite's backend.


Pretty alpha for now, but definitely worth keeping an eye on to potentially replace our burgeoning Carbon fleet...
graphite  carbon  cassandra  storage  metrics  ops  graphs  service-metrics 
december 2013 by jm
Kelly "kellabyte" Sommers on Redis' "relaxed CP" approach to the CAP theorem

Similar to ACID properties, if you partially provide properties it means the user has to _still_ consider in their application that the property doesn't exist, because sometimes it doesn't. In you're fsync example, if fsync is relaxed and there are no replicas, you cannot consider the database durable, just like you can't consider Redis a CP system. It can't be counted on for guarantees to be delivered. This is why I say these systems are hard for users to reason about. Systems that partially offer guarantees require in-depth knowledge of the nuances to properly use the tool. Systems that explicitly make the trade-offs in the designs are easier to reason about because it is more obvious and _predictable_.
kellabyte  redis  cp  ap  cap-theorem  consistency  outages  reliability  ops  database  storage  distcomp 
december 2013 by jm
Chef Testing at PagerDuty
Good article on how PagerDuty test their chef changes -- lint, unit tests using ChefSpec, integ tests and their "Failure Friday" game days
testing  chef  ops  devops  chefspec  game-days  pagerduty 
december 2013 by jm
Flock for Cron jobs
good blog post writing up the 'flock -n -c' trick to ensure single-concurrent-process locking for cron jobs
cron  concurrency  unix  linux  flock  locking  ops 
december 2013 by jm
Failure Friday: How We Ensure PagerDuty is Always Reliable
Basically, they run the kind of exercise which Jesse Robbins invented at Amazon -- "Game Days". Scarily, they do these on a Friday -- living dangerously!
game-days  testing  failure  devops  chaos-monkey  ops  exercises 
november 2013 by jm
Rasmus' home NAS design
I'm trying to avoid doing this in order to avoid more power consumption and unpopular hardware in the house -- but if necessary, this is a good up-to-date homebuild design
nas  hardware  home  storage  ops  disks 
november 2013 by jm
Backblaze Blog » How long do disk drives last?
According to Backblaze's data, 80% of drives last 4 years, and the median lifespan is projected to be 6 years
backblaze  storage  disk  ops  mtbf  hardware  failure  lifespan 
november 2013 by jm
« earlier      
per page:    204080120160

related tags

accidents  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  agpl  airbnb  aix  alarm-fatigue  alarming  alarms  alert-logic  alerting  alerts  alestic  algorithms  alter-table  ama  amazon  analytics  anomaly-detection  anti-spam  antipatterns  ap  apache  aphyr  app-engine  apt  archaius  architecture  asg  asgard  atomic  authentication  auto-scaling  automation  autoremediation  autoscaling  availability  aws  az  azure  backblaze  backup  backups  banking  baron-schwartz  basho  bdb  bdb-je  bdd  beanstalk  ben-treynor  benchmarks  best-practices  big-data  billing  bit-errors  bitcoin  bitly  bitrot  bloat  blockdev  blogs  boundary  broadcast  bugs  build  build-out  ca-7  caching  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  case-studies  cassandra  censum  cfengine  cgroups  change-monitoring  chaos-monkey  checklists  chef  chefspec  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cli  cloud  cloudwatch  cluster  clustering  clusters  cms  code-spaces  coding  coinbase  collaboration  command-line  commercial  company  compatibility  complexity  compression  concurrency  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy-on-write  copyright  coreutils  corruption  cp  crash-only-software  criu  cron  culture  daemon  daemons  dashboards  data  data-corruption  database  databases  datacenters  dataviz  debian  debug  debugging  decay  delete  delivery  demo  dependencies  deploy  deployinator  deployment  desktops  dev  developers  devops  diagnosis  digital-ocean  disk  disks  distcomp  distributed  distributed-systems  distros  diy  dmca  dns  docker  documentation  dotcloud  drivers  dropbox  dstat  duplicity  duply  dynamic-configuration  dynect  ebs  ec2  elastic-scaling  elasticsearch  email  emr  encryption  engineering  ensemble  erasure-coding  etsy  eventual-consistency  exercises  exponential-decay  extortion  fabric  facebook  fail  failure  false-positives  fault-tolerance  fcron  feature-flags  file-transfer  filesystems  fincore  firefighting  flapjack  flock  forecasting  foursquare  freebsd  front-ends  fs  fsync  ftrace  g1  g1gc  gae  game-days  gating  gc  gilt-groupe  git  github  gnome  go  god  google  gossip  graphing  graphite  graphs  gzip  hacks  hadoop  haproxy  hardware  hdds  heap  heartbeats  hero-coder  hero-culture  hn  holt-winters  home  honeypot  hosting  hotspot  hrd  http  hystrix  ian-wilkes  ibm  images  inactivity  incident-response  inept  infrastructure  init  inspeqtor  instrumentation  integration-tests  internet  interviews  inviso  io  iops  iostat  ioutil  ip  iptables  ironfan  java  jay-kreps  jcmd  jdk  jenkins  jepsen  jmx  jmxtrans  john-allspaw  juniper  jvm  kafka  kdd  kde  kellabyte  knife  laptops  latency  legacy  leveldb  lifespan  limits  linden  linkedin  links  linode  linux  live  load  load-balancers  load-balancing  locking  logentries  logging  loose-coupling  lsb  lsof  lsx  luks  lxc  mac  machine-learning  macosx  mail  maintainance  mandos  map-reduce  measurements  memory  mesos  metrics  mfa  microsoft  migrations  mincore  mirroring  mit  mongodb  monit  monitorama  monitoring  movies  mtbf  mysql  nagios  namespaces  nannies  nas  natwest  nerve  netflix  netstat  network  network-monitoring  network-partitions  networking  networks  nginx  nix  nixos  node.js  nosql  notification  notifications  ntp  ntpd  obama  omniti  oom  open-source  openjdk  operations  ops  optimization  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outsourcing  packaging  page-cache  pager-duty  pagerduty  pages  paging  papers  partition  partitions  passenger  paxos  pbailis  pdf  percona  performance  phusion  pie  pillar  pinterest  piops  pixar  plumbr.eu  post-mortems  postgres  postmortems  presentations  pricing  prioritisation  procedures  processes  procfs  production  profiling  programming  provisioning  pty  puppet  python  qa  queueing  rabbitmq  raid  rails  rami-rosen  randomization  ranking  rant  rate-limiting  rbs  rdbms  rds  recovery  reddit  redis  redshift  refactoring  reference  release  reliability  remediation  replicas  replication  resiliency  restoring  riak  risks  rm-rf  rollback  root-cause  route53  routing  rspec  ruby  runbooks  runit  s3  s3funnel  s3ql  safety  sanity-checks  scala  scalability  scaling  scheduler  schema  scripts  sdd  seagate  search  security  sensu  serf  serialization  server  servers  serverspec  service-discovery  service-metrics  services  ses  sharding  shorn-writes  silos  sleep  slew  slides  smartstack  smtp  snappy  snapshots  sns  soa  software  solaris  soundcloud  space  split-brain  spot-instances  sql  sre  ssd  ssh  stack  statistics  stats  statsd  statsite  stepping  storage  storm  strace  streaming  strider  supervision  supervisord  support  survey  svctm  syadmin  synapse  sysadmin  sysdig  syslog  system-testing  systemd  systems  tahoe-lafs  tcp  tcpdump  tdd  teams  tech-debt  testing  threadpools  throughput  thundering-herd  tildeslash  time  time-machine  time-synchronization  tips  tools  trace  tracer-requests  tracing  trading  training  troubleshooting  tuning  turing-complete  twilio  twitter  two-factor-authentication  uat  ubuntu  ui  ulster-bank  ultradns  unicorn  unit-testing  unit-tests  unix  upgrades  upstart  usenix  vagrant  versioning  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:fanf  via:filippo  via:jk  via:kragen  via:lusis  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  virtualisation  virtualization  vm  vms  voldemort  vpc  web  web-services  webmail  weighting  wiki  work  yammer  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark:



description:


tags: