jm + ops   122

Dead Man's Snitch
a cron job monitoring tool that keeps an eye on your periodic processes and notifies you when something doesn't happen. Daily backups, monthly emails, or cron jobs you need to monitor? Dead Man's Snitch has you covered. Know immediately when one of these processes doesn't work.


via Marc.
alerts  cron  monitoring  sysadmin  ops  backups  alarms 
7 days ago by jm
sysdig
open source, system-level exploration: capture system state and activity from a running Linux instance, then save, filter and analyze.
Think of it as strace + tcpdump + lsof + awesome sauce.
With a little Lua cherry on top.


This sounds excellent. Linux-based, GPLv2.
debugging  tools  linux  ops  tracing  strace  open-source  sysdig  cli  tcpdump  lsof 
11 days ago by jm
Adrian Cockroft's Cloud Outage Reports Collection
The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. [....] I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them.
outages  post-mortems  documentation  ops  aws  ec2  amazon  google  dropbox  microsoft  azure  incident-response 
22 days ago by jm
htcat/htcat
a utility to perform parallel, pipelined execution of a single HTTP GET. htcat is intended for the purpose of incantations like: htcat https://host.net/file.tar.gz | tar -zx

It is tuned (and only really useful) for faster interconnects: [....] 109MB/s on a gigabit network, between an AWS EC2 instance and S3. This represents 91% use of the theoretical maximum of gigabit (119.2 MiB/s).
go  cli  http  file-transfer  ops  tools 
27 days ago by jm
S3QL
a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X.

S3QL is a standard conforming, full featured UNIX file system that is conceptually indistinguishable from any local file system. Furthermore, S3QL has additional features like compression, encryption, data de-duplication, immutable trees and snapshotting which make it especially suitable for online backup and archival.
s3  s3ql  backup  aws  filesystems  linux  freebsd  osx  ops 
28 days ago by jm
ZooKeeper Resilience at Pinterest
essentially decoupling the client services from ZK using a local daemon on each client host; very similar to Airbnb's Smartstack. This is a bit of an indictment of ZK's usability though
ops  architecture  clustering  network  partitions  cap  reliability  smartstack  airbnb  pinterest  zookeeper 
6 weeks ago by jm
Migrating from MongoDB to Cassandra
Interesting side-effect of using LUKS for full-disk encryption: 'For every disk read, we were pulling in 3MB of data (RA is sectors, SSZ is sector size, 6144*512=3145728 bytes) into cache. Oops. Not only were we doing tons of extra work, but we were trashing our page cache too. The default for the device-mapper used by LUKS under Ubuntu 12.04LTS is incredibly sub-optimal for database usage, especially our usage of Cassandra (more small random reads vs. large rows). We turned this down to 128 sectors — 64KB.'
cassandra  luks  raid  linux  tuning  ops  blockdev  disks  sdd 
8 weeks ago by jm
Yammer Engineering - Resiliency at Yammer
Not content with adding Hystrix (circuit breakers, threadpooling, request time limiting, metrics, etc.) to their entire SOA stack, they've made it incredibly configurable by hooking in a web-based configuration UI, allowing dynamic on-the-fly reconfiguration by their ops guys of the circuit breakers and threadpools in production. Mad stuff
hystrix  circuit-breakers  resiliency  yammer  ops  threadpools  soa  dynamic-configuration  archaius  netflix 
10 weeks ago by jm
10 Things We Forgot to Monitor
a list of not-so-common outage causes which are easy to overlook; swap rate, NTP drift, SSL expiration, fork rate, etc.
nagios  metrics  ops  monitoring  systems  ntp  bitly 
10 weeks ago by jm
Hero Culture
Good description of the "hero coder" organisational antipattern.
Now imagine that most of the team is involved in fire-fighting. New recruits see the older recruits getting praised for their brave work in the line-of-fire and they want that kind of praise and reward too. Before long everyone is focused on putting out fires and it is no ones interest to step back and take on the risks that long-term DevOps-focused goals entail.
coding  ops  admin  hero-coder  hero-culture  firefighting  organisations  teams  culture 
11 weeks ago by jm
Cassandra: tuning the JVM for read heavy workloads
The cluster we tuned is hosted on AWS and is comprised of 6 hi1.4xlarge EC2 instances, with 2 1TB SSDs raided together in a raid 0 configuration. The cluster’s dataset is growing steadily. At the time of this writing, our dataset is 341GB, up from less than 200GB a few months ago, and is growing by 2-3GB per day. The workload on this cluster is very read heavy, with quorum reads making up 99% of all operations.


Some careful GC tuning here. Probably not applicable to anyone else, but good approach in general.
java  performance  jvm  scaling  gc  tuning  cassandra  ops 
11 weeks ago by jm
Backblaze Blog » What Hard Drive Should I Buy?
Because Backblaze has a history of openness, many readers expected more details in my previous posts. They asked what drive models work best and which last the longest. Given our experience with over 25,000 drives, they asked which ones are good enough that we would buy them again. In this post, I’ll answer those questions.
backblaze  backup  hardware  hdds  storage  disks  ops  via:fanf 
12 weeks ago by jm
deploy_to_runit
A nice node.js app to perform continuous deployment from a GitHub repo via its webhook support, from Matt Sergeant
github  node.js  runit  deployment  git  continuous-deployment  devops  ops 
12 weeks ago by jm
Dr. Bunsen / Time Warp
I use it to modify Time Machine’s backup behavior using weighted reservoir sampling. I built Time Warp to preserve important backup snapshots and prevent Time Machine from deleting them.


via Aman. Nifty!
backup  python  time-machine  decay  exponential-decay  weighting  algorithms  snapshots  ops 
12 weeks ago by jm
"Understanding the Robustness of SSDs under Power Fault", FAST '13 [paper]
Horrific. SSDs (including "enterprise-class storage") storing sync'd writes in volatile RAM while claiming they were synced; one device losing 72.6GB, 30% of its data, after 8 injected power faults; and all SSDs tested displayed serious errors including random bit errors, metadata corruption, serialization errors and shorn writes. Don't trust lone unreplicated, unbacked-up SSDs!
pdf  papers  ssd  storage  reliability  safety  hardware  ops  usenix  serialization  shorn-writes  bit-errors  corruption  fsync 
january 2014 by jm
The How and Why of Flapjack
Flapjack aims to be a flexible notification system that handles:

Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc);
Alert summarisation (with per-user, per media summary thresholds);
Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc).

Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.
flapjack  notification  alerts  ops  nagios  paging  sensu 
january 2014 by jm
BitCoin exchange CoinBase uses MongoDB as their 'primary datastore'
'Coinbase uses MongoDB for their primary datastore for their web app, api requests, etc.'
coinbase  mongodb  reliability  hn  via:aphyr  ops  banking  bitcoin 
december 2013 by jm
Load Balancer Testing with a Honeypot Daemon
nice post on writing BDD unit tests for infrastructure, in this case specifically a load balancer (via Devops Weekly)
load-balancers  ops  devops  sysadmin  testing  unit-tests  networking  honeypot  infrastructure  bdd 
december 2013 by jm
Cyanite
a metric storage daemon, exposing both a carbon listener and a simple web service. Its aim is to become a simple, scalable and drop-in replacement for graphite's backend.


Pretty alpha for now, but definitely worth keeping an eye on to potentially replace our burgeoning Carbon fleet...
graphite  carbon  cassandra  storage  metrics  ops  graphs  service-metrics 
december 2013 by jm
Kelly "kellabyte" Sommers on Redis' "relaxed CP" approach to the CAP theorem

Similar to ACID properties, if you partially provide properties it means the user has to _still_ consider in their application that the property doesn't exist, because sometimes it doesn't. In you're fsync example, if fsync is relaxed and there are no replicas, you cannot consider the database durable, just like you can't consider Redis a CP system. It can't be counted on for guarantees to be delivered. This is why I say these systems are hard for users to reason about. Systems that partially offer guarantees require in-depth knowledge of the nuances to properly use the tool. Systems that explicitly make the trade-offs in the designs are easier to reason about because it is more obvious and _predictable_.
kellabyte  redis  cp  ap  cap-theorem  consistency  outages  reliability  ops  database  storage  distcomp 
december 2013 by jm
Chef Testing at PagerDuty
Good article on how PagerDuty test their chef changes -- lint, unit tests using ChefSpec, integ tests and their "Failure Friday" game days
testing  chef  ops  devops  chefspec  game-days  pagerduty 
december 2013 by jm
Flock for Cron jobs
good blog post writing up the 'flock -n -c' trick to ensure single-concurrent-process locking for cron jobs
cron  concurrency  unix  linux  flock  locking  ops 
december 2013 by jm
Failure Friday: How We Ensure PagerDuty is Always Reliable
Basically, they run the kind of exercise which Jesse Robbins invented at Amazon -- "Game Days". Scarily, they do these on a Friday -- living dangerously!
game-days  testing  failure  devops  chaos-monkey  ops  exercises 
november 2013 by jm
Rasmus' home NAS design
I'm trying to avoid doing this in order to avoid more power consumption and unpopular hardware in the house -- but if necessary, this is a good up-to-date homebuild design
nas  hardware  home  storage  ops  disks 
november 2013 by jm
Backblaze Blog » How long do disk drives last?
According to Backblaze's data, 80% of drives last 4 years, and the median lifespan is projected to be 6 years
backblaze  storage  disk  ops  mtbf  hardware  failure  lifespan 
november 2013 by jm
10 Things You Should Know About AWS
Some decent tips in here, mainly EC2-focussed
amazon  ec2  aws  ops  rds 
november 2013 by jm
Why Every Company Needs A DevOps Team Now - Feld Thoughts
Bookmarking particularly for the 3 "favourite DevOps patterns":

"Make sure we have environments available early in the Development process"; enforce a policy that the code and environment are tested together, even at the earliest stages of the project; “Wake up developers up at 2 a.m. when they break things"; and "Create reusable deployment procedures".
devops  work  ops  deployment  testing  pager-duty 
november 2013 by jm
Statsite
A C reimplementation of Etsy's statsd, with some interesting memory optimizations.
Statsite is designed to be both highly performant, and very flexible. To achieve this, it implements the stats collection and aggregation in pure C, using libev to be extremely fast. This allows it to handle hundreds of connections, and millions of metrics. After each flush interval expires, statsite performs a fork/exec to start a new stream handler invoking a specified application. Statsite then streams the aggregated metrics over stdin to the application, which is free to handle the metrics as it sees fit. This allows statsite to aggregate metrics and then ship metrics to any number of sinks (Graphite, SQL databases, etc). There is an included Python script that ships metrics to graphite.
statsd  graphite  statsite  performance  statistics  service-metrics  metrics  ops 
november 2013 by jm
Serf
'a service discovery and orchestration tool that is decentralized, highly available, and fault tolerant. Serf runs on every major platform: Linux, Mac OS X, and Windows. It is extremely lightweight: it uses 5 to 10 MB of resident memory and primarily communicates using infrequent UDP messages [and an] efficient gossip protocol.'
clustering  service-discovery  ops  linux  gossip  broadcast  clusters 
november 2013 by jm
Counterfactual Thinking, Rules, and The Knight Capital Accident
John Allspaw with an interesting post on the Knight Capital disaster
john-allspaw  ops  safety  post-mortems  engineering  procedures 
october 2013 by jm
Airbnb's Smartstack
Service discovery a la Airbnb -- Nerve and Synapse: two external daemons that run on each node, Nerve to manage registration in Zookeeper, and Synapse to generate a haproxy configuration file from that, running on each host, allowing connections to all other hosts.
haproxy  services  ops  load-balancing  service-discovery  nerve  synapse  airbnb 
october 2013 by jm
Basho and Seagate partner to deliver scale-out cloud storage breakthrough
Ha, cool. Skip the OS, write the Riak store natively to the drive. This sounds frankly terrifying ;)
The Seagate Kinetic Open Storage platform eliminates the storage server tier of traditional data center architectures by enabling applications to speak directly to the storage system, thereby reducing expenses associated with the acquisition, deployment, and support of hyperscale storage infrastructures. The platform leverages Seagate’s expertise in hardware and software storage systems integrating an open source API and Ethernet connectivity with Seagate hard drive technology.
seagate  basho  riak  storage  hardware  drivers  os  ops 
october 2013 by jm
How to lose $172,222 a second for 45 minutes
Major outage and $465m of trading loss, caused by staggeringly inept software management: 8 years of incremental bitrot, technical debt, and failure to have correct processes to engage an ops team in incident response. Hopefully this will serve as a lesson that software is more than just coding, at least to one industry
trading  programming  coding  software  inept  fail  bitrot  tech-debt  ops  incident-response 
october 2013 by jm
"Toy Story 2" was almost entirely deleted by accident at one point
A stray "rm -rf" on the main network share managed to wipe out 90% of the movie's assets, and the backups were corrupt. Horrific backups war story
movies  ops  backups  pixar  recovery  accidents  rm-rf  delete 
october 2013 by jm
Introducing Chaos to C*
Autoremediation, ie. auto-replacement, of Cassandra nodes in production at Netflix
ops  autoremediation  outages  remediation  cassandra  storage  netflix  chaos-monkey 
october 2013 by jm
"What Should I Monitor?"
slides (lots of slides) from Baron Schwartz' talk at Velocity in NYC.
slides  monitoring  metrics  ops  devops  baron-schwartz  pdf  capacity 
october 2013 by jm
pt-summary
from the Percona toolkit. 'Conveniently summarizes the status and configuration of a server. It is not a tuning tool or diagnosis tool. It produces a report that is easy to diff and can be pasted into emails without losing the formatting. This tool works well on many types of Unix systems.' --- summarises OOM history, top, netstat connection table, interface stats, network config, RAID, LVM, disks, inodes, disk scheduling, mounts, memory, processors, and CPU.
percona  tools  cli  unix  ops  linux  diagnosis  raid  netstat  oom 
october 2013 by jm
Mesosphere · Docker on Mesos
This is cool. Deploy Docker container images onto a Mesos cluster: key point, in the description of the Redis example: 'there’s no need to install Redis or its supporting libraries on your Mesos hosts.'
mesos  docker  deployment  ops  images  virtualization  containers  linux 
september 2013 by jm
Getting Real About Distributed System Reliability
I have come around to the view that the real core difficulty of [distributed] systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters. Worse yet, you can’t easily buy good operations in the same way you can buy good software—you might be able to hire good people (if you can find them) but this is more than just people; it is practices, monitoring systems, configuration management, etc.
reliability  nosql  distributed-systems  jay-kreps  ops 
september 2013 by jm
Creating Flight Recordings
lots more detail on the new "Java Mission Control" feature in Hotspot 7u40 JVMs, and how to use it to start and stop profiling in a live, production JVM from a separate "jcmd" command-line client. If the overhead is small, this could be really neat -- turn on profiling for 1 minute every hour on a single instance, and collect realtime production profile data on an automated basis for post-facto analysis if required
instrumentation  logging  profiling  java  jvm  ops 
september 2013 by jm
Voldemort on Solid State Drives [paper]
'This paper and talk was given by the LinkedIn Voldemort Team at the Workshop on Big Data Benchmarking (WBDB May 2012).'

With SSD, we find that garbage collection will become a very significant bottleneck, especially for systems which have little control over the storage layer and rely on Java memory management. Big heapsizes make the cost of garbage collection expensive, especially the single threaded CMS Initial mark. We believe that data systems must revisit their caching strategies with SSDs. In this regard, SSD has provided an efficient solution for handling fragmentation and moving towards predictable multitenancy.
voldemort  storage  ssd  disk  linkedin  big-data  jvm  tuning  ops  gc 
september 2013 by jm
Interview with the Github Elasticsearch Team
good background on Github's Elasticsearch scaling efforts. Some rather horrific split-brain problems under load, and crashes due to OpenJDK bugs (sounds like OpenJDK *still* isn't ready for production). painful
elasticsearch  github  search  ops  scaling  split-brain  outages  openjdk  java  jdk  jvm 
september 2013 by jm
Docker: Git for deployment
Docker is to deployment as Git is to development.

Developers are able to leverage Git's performance and flexibility when building applications. Git encourages experiments and doesn't punish you when things go wrong: start your experiments in a branch, if things fall down, just git rebase or git reset. It's easy to start a branch and fast to push it.

Docker encourages experimentation for operations. Containers start quickly. Building images is a snap. Using another images as a base image is easy. Deploying whole images is fast, and last but not least, it's not painful to rollback.

Fast + flexible = deployments are about to become a lot more enjoyable.
docker  deployment  sysadmin  ops  devops  vms  vagrant  virtualization  containers  linux  git 
august 2013 by jm
Juniper Adds Puppet support
This is super-cool.

'Network engineering no longer should be mundane tasks like conf, set interfaces fe-0/0/0 unit o family inet address 10.1.1.1/24. How does mindless CLI work translate to efficiently spent time ? What if you need to change 300 devices? What if you are writing it by hand? An error-prone waste of time. Juniper today announced Puppet support for their 12.2R3,5 JUNOS code. This is compatible with EX4200, EX4550, and QFX3500 switches. These are top end switches, but this start is directly aimed at their DC and enterprise devices. Initially, the manifest interactions offered are interface, layer 2 interface, vlan, port aggregation groups, and device names.'

Based on what I saw in the Network Automation team in Amazon, this is an amazing leap forward; it'd instantly render obsolete a bunch of horrific SSH-CLI automation cruft.
ssh  cli  automation  networking  networks  puppet  ops  juniper  cisco 
august 2013 by jm
Information on Google App Engine's recent US datacenter relocations - Google Groups
or, really, 'why we had some glitches and outages recently'. A few interesting tidbits about GAE innards though (via Bill De hOra)
gae  google  app-engine  outages  ops  paxos  eventual-consistency  replication  storage  hrd 
august 2013 by jm
How to configure ntpd so it will not move time backwards
The "-x" switch will expand the step/slew boundary from 128ms to 600 seconds, ensuring the time is slewed (drifted slowly towards the correct time at a max of 5ms per second) rather than "stepped" (a sudden jump, potentially backwards). Since slewing has a max of 5ms per second, time can never "jump backwards", which is important to avoid some major application bugs (particularly in Java timers).
ntpd  time  ntp  ops  sysadmin  slew  stepping  time-synchronization  linux  unix  java  bugs 
august 2013 by jm
 Censum
[JVM] GC is a difficult, specialised area that can be very frustrating for busy developers or devops folks to deal with. The JVM has a number of Garbage Collectors and a bewildering array of switches that can alter the behaviour of each collector. Censum does all of the parsing, number crunching and statistical analysis for you, so you don't have to go and get that PhD in Computer Science in order to solve your GC performance problem. Censum gives you straight answers as opposed to a ton of raw data. can eat any GC log you care to throw at it. is easy to install and use.


Commercial software, UKP 495 per license.
censum  gc  tuning  ops  java  jvm  commercial 
july 2013 by jm
Tuning and benchmarking Java 7's Garbage Collectors: Default, CMS and G1
Rudiger Moller runs through a typical GC-tuning session, in exhaustive detail
java  gc  tuning  jvm  cms  g1  ops 
july 2013 by jm
Twilio Billing Incident Post-Mortem
At 1:35 AM PDT on July 18, a loss of network connectivity caused all billing redis-slaves to simultaneously disconnect from the master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.
By 2:39 AM PDT the host’s load became so extreme, services relying on redis-master began to fail. At 2:42 AM PDT, our monitoring system alerted our on-call engineering team of a failure in the Redis cluster. Observing extreme load on the host, the redis process on redis-master was misdiagnosed as requiring a restart to recover. This caused redis-master to read an incorrect configuration file, which in turn caused Redis to attempt to recover from a non-existent AOF file, instead of the binary snapshot. As a result of that failed recovery, redis-master dropped all balance data. In addition to forcing recovery from a non-existent AOF, an incorrect configuration also caused redis-master to boot as a slave of itself, putting it in read-only mode and preventing the billing system from updating account balances.

See also http://antirez.com/news/60 for antirez' response.

Here's the takeaways I'm getting from it:

1. network partitions happen in production, and cause cascading failures. this is a great demo of that.

2. don't store critical data in Redis. this was the case for Twilio -- as far as I can tell they were using Redis as a front-line cache for billing data -- but it's worth saying anyway. ;)

3. Twilio were just using Redis as a cache, but a bug in their code meant that the writes to the backing SQL store were not being *read*, resulting in repeated billing and customer impact. In other words, it turned a (fragile) cache into the authoritative store.

4. they should probably have designed their code so that write failures would not result in repeated billing for customers -- that's a bad failure path.

Good post-mortem anyway, and I'd say their customers are a good deal happier to see this published, even if it contains details of the mistakes they made along the way.
redis  caching  storage  networking  network-partitions  twilio  postmortems  ops  billing  replication 
july 2013 by jm
Next Generation Continuous Integration & Deployment with dotCloud’s Docker and Strider
Since Docker treats it’s images as a tree of derivations from a source image, you have the ability to store an image at each stage of a build. This means we can provide full binary images of the environment in which the tests failed. This allows you to run locally bit-for-bit the same container as the CI server ran. Due to the magic of Docker and AUFS Copy-On-Write filesystems, we can store this cheaply.

Often tests pass when built in a CI environment, but when built in another (e.g. production) environment break due to subtle differences. Docker makes it trivial to take exactly the binary environment in which the tests pass, and ship that to production to run it.
docker  strider  continuous-integration  continuous-deployment  deployment  devops  ops  dotcloud  lxc  virtualisation  copy-on-write  images 
july 2013 by jm
Docker
'the Linux container engine'. I totally misunderstood what Docker was -- this is cool.
Heterogeneous payloads: Any combination of binaries, libraries, configuration files, scripts, virtualenvs, jars, gems, tarballs, you name it. No more juggling between domain-specific tools. Docker can deploy and run them all.

Any server: Docker can run on any x64 machine with a modern linux kernel - whether it's a laptop, a bare metal server or a VM. This makes it perfect for multi-cloud deployments.

Isolation: Docker isolates processes from each other and from the underlying host, using lightweight containers.

Repeatability: Because each container is isolated in its own filesystem, they behave the same regardless of where, when, and alongside what they run.
lxc  containers  virtualization  cloud  ops  linux  docker  deployment 
july 2013 by jm
Boundary's Early Warnings alarm
Anomaly detection on network throughput metrics, alarming if throughputs on selected flows deviate by 1, 2, or 3 standard deviations from a historical baseline.
network-monitoring  throughput  boundary  service-metrics  alarming  ops  statistics 
june 2013 by jm
how RAID fits in with Riak
Write heavy, high performance applications should probably use RAID 0 or avoid RAID altogether and consider using a larger n_val and cluster size. Read heavy applications have more options, and generally demand more fault tolerance with the added benefit of easier hardware replacement procedures.


Good to see official guidance on this (via Bill de hOra)
via:dehora  riak  cluster  fault-tolerance  raid  ops 
june 2013 by jm
Project Voldemort: measuring BDB space consumption
HOWTO measure this using the BDB-JE command line tools. this is exposed through JMX as the CleanerBacklog metric, too, I think, but good to bookmark just in case
voldemort  cleaner  bdb  ops  space  storage  monitoring  debug 
june 2013 by jm
graphite-metrics
metric collectors for various stuff not (or poorly) handled by other monitoring daemons

Core of the project is a simple daemon (harvestd), which collects metric values and sends them to graphite carbon daemon (and/or other configured destinations) once per interval. Includes separate data collection components ("collectors") for processing of:

/proc/slabinfo for useful-to-watch values, not everything (configurable).
/proc/vmstat and /proc/meminfo in a consistent way.
/proc/stat for irq, softirq, forks.
/proc/buddyinfo and /proc/pagetypeinfo (memory fragmentation).
/proc/interrupts and /proc/softirqs.
Cron log to produce start/finish events and duration for each job into a separate metrics, adapts jobs to metric names with regexes.
Per-system-service accounting using systemd and it's cgroups.
sysstat data from sadc logs (use something like sadc -F -L -S DISK -S XDISK -S POWER 60 to have more stuff logged there) via sadf binary and it's json export (sadf -j, supported since sysstat-10.0.something, iirc).
iptables rule "hits" packet and byte counters, taken from ip{,6}tables-save, mapped via separate "table chain_name rule_no metric_name" file, which should be generated along with firewall rules (I use this script to do that).


Pretty exhaustive list of system metrics -- could have some interesting ideas for Linux OS-level metrics to monitor in future.
graphite  monitoring  metrics  unix  linux  ops  vm  iptables  sysadmin 
june 2013 by jm
Care and Feeding of Large Scale Graphite Installations [slides]
good docs for large-scale graphite use: 'Tip and tricks of using and scaling graphite. First presented at DevOpsDays Austin Texas 2013-05-01'
graphite  devops  ops  metrics  dashboards  sysadmin 
june 2013 by jm
Communication costs in real-world networks
Peter Bailis has generated some good real-world data about network performance and latency, measured using EC2 instances, between ec2 regions, between zones, and between hosts in a single AZ. good data (particularly as I was looking for this data in a public source not too long ago).

I wasn’t aware of any datasets describing network behavior both within and across datacenters, so we launched m1.small Amazon EC2 instances in each of the eight geo-distributed “Regions,” across the three us-east “Availability Zones” (three co-located datacenters in Virginia), and within one datacenter (us-east-b). We measured RTTs between hosts for a week at a granularity of one ping per second.


Some of the high-percentile measurements are undoubtedly impact of host and VM behaviour, but that is still good data for a typical service built in EC2.
networks  performance  measurements  benchmarks  ops  ec2  networking  internet  az  latency 
may 2013 by jm
My Philosophy on Alerting
'based on my observations while I was a Site Reliability Engineer at Google.' - by Rob Ewaschuk; very good, and matching the similar recommendations and best practices at Amazon for that matter
monitoring  ops  devops  alerting  alerts  pager-duty  via:jk 
may 2013 by jm
Monitoring the Status of Your EBS Volumes
Page in the AWS docs which describes their derived metrics and how they are computed -- these are visible in the AWS Management Console, and alarmable, but not viewable in the Cloudwatch UI. grr. (page-joshea!)
ebs  aws  monitoring  metrics  ops  documentation  cloudwatch 
may 2013 by jm
Operations is Dead, but Please Don’t Replace it with DevOps
This is so damn spot on.
Functional silos (and a standalone DevOps team is a great example of one) decouple actions from responsibility. Functional silos allow people to ignore, or at least feel disconnected from, the consequences of their actions. DevOps is a cultural change that encourages, rewards and exposes people taking responsibility for what they do, and what is expected from them. As Werner Vogels from Amazon Web Services says, “you build it, you run it”. So a “DevOps team” is a risky and ultimately doomed strategy. Sure there are some technical roles, specifically related to the enablement of DevOps as an approach and these roles and tools need to be filled and built. Self service platforms, collaboration and communication systems, tool chains for testing, deployment and operations are all necessary. Sure someone needs to deliver on that stuff. But those are specific technical deliverables and not DevOps. DevOps is about people, communication and collaboration. Organizations ignore that at their peril.
devops  teams  work  ops  silos  collaboration  organisations 
may 2013 by jm
AWS forum post on interpreting iostat output for EBS
Great post from AndrewC@EBS on interpreting iostat output on EBS volumes -- from 2009, but still looks reasonable enough
iostat  ebs  disks  hardware  aws  ops 
may 2013 by jm
Measuring & Optimizing I/O Performance
Another good writeup on iostat and EBS, from Ilya Grigorik
io  optimization  sysadmin  performance  iostat  ebs  aws  ops 
may 2013 by jm
ec2-consistent-snapshot
This program creates an EBS snapshot for an Amazon EC2 EBS volume. To
help ensure consistent data in the snapshot, it tries to flush and
freeze the filesystem(s) first as well as flushing and locking the
database, if applicable.

Filesystems can be frozen during the snapshot. Prior to Linux kernel
2.6.29, XFS must be used for freezing support. While frozen, a
filesystem will be consistent on disk and all writes will block.

There are a number of timeouts to reduce the risk of interfering with
the normal database operation while improving the chances of getting a
consistent snapshot.

If you have multiple EBS volumes in a RAID configuration, you can
specify all of the volume ids on the command line and it will create
snapshots for each while the filesystem and database are locked. Note
that it is your responsibility to keep track of the resulting snapshot
ids and to figure out how to put these back together when you need to
restore the RAID setup.


Handy!
ubuntu  ec2  aws  linux  ebs  snapshots  ops  tools  alestic 
may 2013 by jm
Making sense out of BDB-JE fast stats
good info on the system metrics recorded by BDB-JE's EnvironmentStats code, particularly where cache and cleaner activity are concerned. Particularly useful for Voldemort
voldemort  caching  bdb  bdb-je  storage  tuning  ops  metrics  reference 
may 2013 by jm
Understanding Elastic Block Store Availability and Performance [slides]
fantastic in-depth presentation on EBS usage; lots of good advice here if you're using EBS volumes with/without PIOPS
piops  ebs  performance  aws  ec2  ops  storage  amazon  presentations 
may 2013 by jm
The useful JVM options
a good reference, with lots of sample output. Not clear if it takes 1.6/1.7 differences into account, though
jvm  reference  java  ops  hotspot  command-line 
april 2013 by jm
Project Voldemort at Gilt Groupe: When Failure Isn't an Option [slides]
Geir Magnusson explains how Gilt Groupe is using Project Voldemort to scale out their e-commerce transactional system. The initial SQL solution had to be replaced because it could not handle the transactional spikes the site is experiencing daily due to its particular way of selling their inventory: each day at noon. Magnusson explains why they chose Voldemort and talks about the architecture.

via Filippo
via:filippo  database  architecture  nosql  data  voldemort  gilt-groupe  ops  storage  presentations 
april 2013 by jm
Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node
an excellent writeup on Kafka 0.8's use and operation, including details of the new replication features
kafka  replication  queueing  distributed  ops 
april 2013 by jm
« earlier      
per page:    204080120160

related tags

accidents  activemq  admin  adrian-cockcroft  advent  airbnb  alarming  alarms  alert-logic  alerting  alerts  alestic  algorithms  alter-table  ama  amazon  analytics  anomaly-detection  anti-spam  antipatterns  ap  apache  app-engine  archaius  architecture  asgard  automation  autoremediation  aws  az  azure  backblaze  backup  backups  banking  baron-schwartz  basho  bdb  bdb-je  bdd  benchmarks  best-practices  big-data  billing  bit-errors  bitcoin  bitly  bitrot  blockdev  blogs  boundary  broadcast  bugs  build  build-out  ca-7  caching  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  case-studies  cassandra  censum  cfengine  change-monitoring  chaos-monkey  checklists  chef  chefspec  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cli  cloud  cloudwatch  cluster  clustering  clusters  cms  coding  coinbase  collaboration  command-line  commercial  company  compression  concurrency  confidence-bands  configuration  consistency  containerization  containers  continuous-deployment  continuous-integration  continuousintegration  copy-on-write  corruption  cp  crash-only-software  cron  culture  daemon  daemons  dashboards  data  data-corruption  database  databases  datacenters  dataviz  debug  debugging  decay  delete  demo  deploy  deployinator  deployment  desktops  dev  developers  devops  diagnosis  digital-ocean  disk  disks  distcomp  distributed  distributed-systems  dns  docker  documentation  dotcloud  drivers  dropbox  dstat  duplicity  duply  dynamic-configuration  dynect  ebs  ec2  elasticsearch  engineering  ensemble  erasure-coding  etsy  eventual-consistency  exercises  exponential-decay  fabric  facebook  fail  failure  false-positives  fault-tolerance  file-transfer  filesystems  firefighting  flapjack  flock  forecasting  foursquare  freebsd  fs  fsync  g1  g1gc  gae  game-days  gc  gilt-groupe  git  github  go  god  google  gossip  graphing  graphite  graphs  gzip  hadoop  haproxy  hardware  hdds  hero-coder  hero-culture  hn  holt-winters  home  honeypot  hosting  hotspot  hrd  http  hystrix  ian-wilkes  ibm  images  incident-response  inept  infrastructure  instrumentation  internet  io  iops  iostat  iptables  ironfan  java  jay-kreps  jdk  jmx  jmxtrans  john-allspaw  juniper  jvm  kafka  kdd  kde  kellabyte  knife  laptops  latency  legacy  lifespan  linden  linkedin  links  linux  live  load  load-balancers  load-balancing  locking  logging  lsb  lsof  luks  lxc  mac  machine-learning  macosx  maintainance  map-reduce  measurements  mesos  metrics  microsoft  migrations  mirroring  mongodb  monitorama  monitoring  movies  mtbf  mysql  nagios  nannies  nas  natwest  nerve  netflix  netstat  network  network-monitoring  network-partitions  networking  networks  nginx  node.js  nosql  notification  ntp  ntpd  obama  omniti  oom  open-source  openjdk  operations  ops  optimization  organisations  os  osx  ouch  out-of-band  outage  outages  outsourcing  pager-duty  pagerduty  paging  papers  partitions  passenger  paxos  pdf  percona  performance  phusion  pinterest  piops  pixar  post-mortems  postmortems  presentations  procedures  processes  production  profiling  programming  provisioning  pty  puppet  python  queueing  rabbitmq  raid  rails  rate-limiting  rbs  rdbms  rds  recovery  reddit  redis  redshift  reference  reliability  remediation  replicas  replication  resiliency  restoring  riak  risks  rm-rf  root-cause  route53  routing  rspec  ruby  runit  s3  s3ql  safety  sanity-checks  scalability  scaling  schema  sdd  seagate  search  security  sensu  serialization  server  servers  serverspec  service-discovery  service-metrics  services  sharding  shorn-writes  silos  slew  slides  smartstack  snappy  snapshots  soa  software  space  split-brain  sql  ssd  ssh  stack  statistics  stats  statsd  statsite  stepping  storage  strace  streaming  strider  supervision  supervisord  support  syadmin  synapse  sysadmin  sysdig  systemd  systems  tahoe-lafs  tcp  tcpdump  tdd  teams  tech-debt  testing  threadpools  throughput  time  time-machine  time-synchronization  tips  tools  tracer-requests  tracing  trading  training  troubleshooting  tuning  turing-complete  twilio  twitter  ubuntu  ui  ulster-bank  ultradns  unicorn  unit-testing  unit-tests  unix  upstart  usenix  vagrant  via:aphyr  via:bill-dehora  via:codeslinger  via:dave-doran  via:dehora  via:fanf  via:filippo  via:jk  via:martharotter  via:nelson  virtualisation  virtualization  vm  vms  voldemort  web  web-services  weighting  work  yammer  zipkin  zookeeper  zooko 

Copy this bookmark:



description:


tags: