jm + ops   230

The Four Month Bug: JVM statistics cause garbage collection pauses (
Ugh, tying GC safepoints to disk I/O? bad idea:
The JVM by default exports statistics by mmap-ing a file in /tmp (hsperfdata). On Linux, modifying a mmap-ed file can block until disk I/O completes, which can be hundreds of milliseconds. Since the JVM modifies these statistics during garbage collection and safepoints, this causes pauses that are hundreds of milliseconds long. To reduce worst-case pause latencies, add the -XX:+PerfDisableSharedMem JVM flag to disable this feature. This will break tools that read this file, like jstat.
bugs  gc  java  jvm  disk  mmap  latency  ops  jstat 
yesterday by jm
Transparent huge pages implicated in Redis OOM
A nasty real-world prod error scenario worsened by THPs:
jemalloc(3) extensively uses madvise(2) to notify the operating system that it's done with a range of memory which it had previously malloc'ed. The page size on this machine is 2MB because transparent huge pages are in use. As such, a lot of the memory which is being marked with madvise(..., MADV_DONTNEED) is within substantially smaller ranges than 2MB. This means that the operating system never was able to evict pages which had ranges marked as MADV_DONTNEED because the entire page has to be unneeded to allow a page to be reused. Despite initially looking like a leak, the operating system itself was unable to free memory because of madvise(2) and transparent huge pages. This led to sustained memory pressure on the machine and redis-server eventually getting OOM killed.
oom-killer  oom  linux  ops  thp  jemalloc  huge-pages  madvise  redis  memory 
4 days ago by jm
"tees" all TCP traffic from one server to another. "widely used by companies in China"!
testing  benchmarking  performance  tcp  ip  tcpcopy  tee  china  regression-testing  stress-testing  ops 
4 days ago by jm
an open source stream processing software system developed by Mozilla. Heka is a “Swiss Army Knife” type tool for data processing, useful for a wide variety of different tasks, such as:

Loading and parsing log files from a file system.
Accepting statsd type metrics data for aggregation and forwarding to upstream time series data stores such as graphite or InfluxDB.
Launching external processes to gather operational data from the local system.
Performing real time analysis, graphing, and anomaly detection on any data flowing through the Heka pipeline.
Shipping data from one location to another via the use of an external transport (such as AMQP) or directly (via TCP).
Delivering processed data to one or more persistent data stores.

Via feylya on twitter. Looks potentially nifty
heka  mozilla  monitoring  metrics  via:feylya  ops  statsd  graphite  stream-processing 
14 days ago by jm
The Large Hadron Migrator is a tool to perform live database migrations in a Rails app without locking.

The basic idea is to perform the migration online while the system is live, without locking the table. In contrast to OAK and the facebook tool, we only use a copy table and triggers. The Large Hadron is a test driven Ruby solution which can easily be dropped into an ActiveRecord or DataMapper migration. It presumes a single auto incremented numerical primary key called id as per the Rails convention. Unlike the twitter solution, it does not require the presence of an indexed updated_at column.
migrations  database  sql  ops  mysql  rails  ruby  lhm  soundcloud  activerecord 
18 days ago by jm
A project to reduce systemd to a base initd, process supervisor and transactional dependency system, while minimizing intrusiveness and isolationism. Basically, it’s systemd with the superfluous stuff cut out, a (relatively) coherent idea of what it wants to be, support for non-glibc platforms and an approach that aims to minimize complicated design. uselessd is still in its early stages and it is not recommended for regular use or system integration.

This may be the best option to evade the horrors of systemd.
init  linux  systemd  unix  ops  uselessd 
18 days ago by jm
Ubuntu To Officially Switch To systemd Next Monday - Slashdot
Jesus. This is going to be the biggest shitfest in the history of Linux...
linux  slashdot  ubuntu  systemd  init  unix  ops 
18 days ago by jm
What Color Is Your Xen?
What a mess.
What's faster: PV, HVM, HVM with PV drivers, PVHVM, or PVH? Cloud computing providers using Xen can offer different virtualization "modes", based on paravirtualization (PV), hardware virtual machine (HVM), or a hybrid of them. As a customer, you may be required to choose one of these. So, which one?
ec2  linux  performance  aws  ops  pv  hvm  xen  virtualization 
4 weeks ago by jm
"Cheap SSL certs from $4.99/yr" -- apparently recommended for cheap, low-end SSL certs
ssl  certs  security  https  ops 
4 weeks ago by jm
Performance Co-Pilot
System performance metrics framework, plugged by Netflix, open-source for ages
open-source  pcp  performance  system  metrics  ops  red-hat  netflix 
5 weeks ago by jm
A gateway script, now included in PCP
pcp2graphite  pcp  graphite  ops  metrics  system 
5 weeks ago by jm
Duplicate SSH Keys Everywhere
Poor hardware imaging practices, basically:
It looks like all devices with the fingerprint are Dropbear SSH instances that have been deployed by Telefonica de Espana. It appears that some of their networking equipment comes setup with SSH by default, and the manufacturer decided to re-use the same operating system image across all devices.
crypto  ssh  security  telefonica  imaging  ops  shodan 
5 weeks ago by jm
A tool for managing Apache Kafka. It supports the following :

Manage multiple clusters;
Easy inspection of cluster state (topics, brokers, replica distribution, partition distribution);
Run preferred replica election;
Generate partition assignments (based on current state of cluster);
Run reassignment of partition (based on generated assignments)
yahoo  kafka  ops  tools 
6 weeks ago by jm
0x74696d | Falling In And Out Of Love with DynamoDB, Part II
Good DynamoDB real-world experience post, via Mitch Garnaat. We should write up ours, although it's pretty scary-stuff-free by comparison
aws  dynamodb  storage  databases  architecture  ops 
6 weeks ago by jm
The DOs and DON'Ts of Blue/Green Deployment - CloudNative
Excellent post -- Delta sounds like a very well-designed product
blue-green-deployments  delta  cloudnative  ops  deploy  ec2  elb 
7 weeks ago by jm
TL;DR: Cassandra Java Huge Pages
Al Tobey does some trial runs of -XX:+AlwaysPreTouch and -XX:+UseHugePages
jvm  performance  tuning  huge-pages  vm  ops  cassandra  java 
7 weeks ago by jm
NA Server Roadmap Update: PoPs, Peering, and the North Bridge
League of Legends has set up private network links to a variety of major US ISPs to avoid internet weather (via Nelson)
via:nelson  peering  games  networks  internet  ops  networking 
7 weeks ago by jm
How TCP backlog works in Linux
good description of the process
ip  linux  tcp  networking  backlog  ops 
8 weeks ago by jm
Nice trick -- wrap servers with a libc wrapper to intercept bind(2) and accept(2) calls, so that transparent restarts becode possible
linux  ops  servers  uptime  restarting  libc  bind  accept  sockets 
8 weeks ago by jm
Maintaining performance in distributed systems [slides]
Great slide deck from Elasticsearch on JVM/dist-sys performance optimization
performance  elasticsearch  java  jvm  ops  tuning 
8 weeks ago by jm
A much better carbon-relay, written in C rather than Python. Linking as we've been using it in production for quite a while with no problems.
The main reason to build a replacement is performance and configurability. Carbon is single threaded, and sending metrics to multiple consistent-hash clusters requires chaining of relays. This project provides a multithreaded relay which can address multiple targets and clusters for each and every metric based on pattern matches.
graphite  carbon  c  python  ops  metrics 
9 weeks ago by jm
Really nice time series dashboarding app. Might consider replacing graphitus with this...
time-series  data  visualisation  graphs  ops  dashboards  facette 
10 weeks ago by jm
AWS Tips I Wish I'd Known Before I Started
Some good advice and guidelines (although some are just silly).
aws  ops  tips  advice  ec2  s3 
10 weeks ago by jm
Personalization at Spotify using Cassandra
Lots and lots of good detail into the Spotify C* setup (via Bill de hOra)
via:dehora  spotify  cassandra  replication  storage  ops 
10 weeks ago by jm
Why we don't use a CDN: A story about SPDY and SSL
All of our assets loaded via the CDN [to our client in Australia] in just under 5 seconds. It only took ~2.7s to get those same assets to our friends down under with SPDY. The performance with no CDN blew the CDN performance out of the water. It is just no comparison. In our case, it really seems that the advantages of SPDY greatly outweigh that of a CDN when it comes to speed.
cdn  spdy  nginx  performance  web  ssl  tls  optimization  multiplexing  tcp  ops 
10 weeks ago by jm
Secure Secure Shell
How to secure SSH, disabling insecure ciphers etc. (via Padraig)
via:pixelbeat  crypto  security  ssh  ops 
11 weeks ago by jm
EC2 Container Service Hands On
Sounds like a good start, but this isn't great:
There is no native integration with Autoscaling or ELBs.
ec2  containers  docker  ecs  ops 
12 weeks ago by jm
'Machine Learning: The High-Interest Credit Card of Technical Debt' [PDF]
Oh god yes. This is absolutely spot on, as you would expect from a Google paper -- at this stage they probably have accumulated more real-world ML-at-scale experience than anywhere else.

'Machine learning offers a fantastically powerful toolkit for building complex systems
quickly. This paper argues that it is dangerous to think of these quick wins
as coming for free. Using the framework of technical debt, we note that it is remarkably
easy to incur massive ongoing maintenance costs at the system level
when applying machine learning. The goal of this paper is highlight several machine
learning specific risk factors and design patterns to be avoided or refactored
where possible. These include boundary erosion, entanglement, hidden feedback
loops, undeclared consumers, data dependencies, changes in the external world,
and a variety of system-level anti-patterns.


'In this paper, we focus on the system-level interaction between machine learning code and larger systems
as an area where hidden technical debt may rapidly accumulate. At a system-level, a machine
learning model may subtly erode abstraction boundaries. It may be tempting to re-use input signals
in ways that create unintended tight coupling of otherwise disjoint systems. Machine learning
packages may often be treated as black boxes, resulting in large masses of “glue code” or calibration
layers that can lock in assumptions. Changes in the external world may make models or input
signals change behavior in unintended ways, ratcheting up maintenance cost and the burden of any
debt. Even monitoring that the system as a whole is operating as intended may be difficult without
careful design.

Indeed, a remarkable portion of real-world “machine learning” work is devoted to tackling issues
of this form. Paying down technical debt may initially appear less glamorous than research results
usually reported in academic ML conferences. But it is critical for long-term system health and
enables algorithmic advances and other cutting-edge improvements.'
machine-learning  ml  systems  ops  tech-debt  maintainance  google  papers  hidden-costs  development 
december 2014 by jm
Two recent systemd crashes
Hey look, PID 1 segfaulting! I haven't seen that happen since we managed to corrupt /bin/sh on Ultrix in 1992. Nice work Fedora
fedora  reliability  unix  linux  systemd  ops  bugs 
december 2014 by jm
Introducing Atlas: Netflix's Primary Telemetry Platform
This sounds really excellent -- the dimensionality problem it deals with is a familiar one, particularly with red/black deployments, autoscaling, and so on creating trees of metrics when new transient servers appear and disappear. Looking forward to Netflix open sourcing enough to make it usable for outsiders
netflix  metrics  service-metrics  atlas  telemetry  ops 
december 2014 by jm
Announcing Snappy Ubuntu
Awesome! I was completely unaware this was coming down the pipeline.
A new, transactionally updated Ubuntu for the cloud. Ubuntu Core is a new rendition of Ubuntu for the cloud with transactional updates. Ubuntu Core is a minimal server image with the same libraries as today’s Ubuntu, but applications are provided through a simpler mechanism. The snappy approach is faster, more reliable, and lets us provide stronger security guarantees for apps and users — that’s why we call them “snappy” applications.

Snappy apps and Ubuntu Core itself can be upgraded atomically and rolled back if needed — a bulletproof approach to systems management that is perfect for container deployments. It’s called “transactional” or “image-based” systems management, and we’re delighted to make it available on every Ubuntu certified cloud.
ubuntu  linux  packaging  snappy  ubuntu-core  transactional-updates  apt  docker  ops 
december 2014 by jm
PDX DevOps Graphite replacement
Replacing graphite with InfluxDB, Riemann and Grafana. Not quite there yet, looks like
influxdb  graphite  ops  metrics  riemann  grafana  slides 
december 2014 by jm
Day 1 - Docker in Production: Reality, Not Hype
Good Docker info from Bridget Kromhout, on their production and dev usage of Docker at DramaFever. lots of good real-world tips
docker  ops  boot2docker  tips  sysadvent 
december 2014 by jm
(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014
Excellent data on current EBS performance characteristics
ebs  ops  aws  reinvent  slides 
november 2014 by jm
AWS re:Invent 2014 Video & Slide Presentation Links
Nice work by Andrew Spyker -- this should be an official feature of the re:Invent website, really
reinvent  aws  conferences  talks  slides  ec2  s3  ops  presentations 
november 2014 by jm
Microsoft Azure 9-hour outage
'From 19 Nov, 2014 00:52 to 05:50 UTC a subset of customers using Storage, Virtual Machines, SQL Geo-Restore, SQL Import/export, Websites, Azure Search, Azure Cache, Management Portal, Service Bus, Event Hubs, Visual Studio, Machine Learning, HDInsights, Automation, Virtual Network, Stream Analytics, Active Directory, StorSimple and Azure Backup Services in West US and West Europe experienced connectivity issues. This incident has now been mitigated.'

There was knock-on impact until 11:00 UTC (storage in N Europe), 11:45 UTC (websites, West Europe), and 09:15 UTC (storage, West Europe), from the looks of things. Should be an interesting postmortem.
outages  azure  microsoft  ops 
november 2014 by jm
The Infinite Hows, instead of the Five Whys
John Allspaw with an interesting assertion that we need to ask "how", not "why" in five-whys postmortems:
“Why?” is the wrong question.

In order to learn (which should be the goal of any retrospective or post-hoc investigation) you want multiple and diverse perspectives. You get these by asking people for their own narratives. Effectively, you’re asking “how?“

Asking “why?” too easily gets you to an answer to the question “who?” (which in almost every case is irrelevant) or “takes you to the ‘mysterious’ incentives and motivations people bring into the workplace.”

Asking “how?” gets you to describe (at least some) of the conditions that allowed an event to take place, and provides rich operational data.
ops  five-whys  john-allspaw  questions  postmortems  analysis  root-causes 
november 2014 by jm
A curated list of Docker resources.
linux  sysadmin  docker  ops  devops  containers  hosting 
november 2014 by jm
Zookeeper: not so great as a highly-available service registry
Turns out ZK isn't a good choice as a service discovery system, if you want to be able to use that service discovery system while partitioned from the rest of the ZK cluster:
I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances.  This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones.  What I saw was that the two other instances noticed the first server “going away”, but they continued to function as they still saw a majority (66%).  More interestingly the first instance noticed the other two servers “going away”, dropping the ensemble availability to 33%.  This caused the first server to stop serving requests to clients (not only writes, but also reads).

So: within that offline AZ, service discovery *reads* (as well as writes) stopped working due to a lack of ZK quorum. This is quite a feasible outage scenario for EC2, by the way, since (at least when I was working there) the network links between AZs, and the links with the external internet, were not 100% overlapping.

In other words, if you want a highly-available service discovery system in the fact of network partitions, you want an AP service discovery system, rather than a CP one -- and ZK is a CP system.

Another risk, noted on the Netflix Eureka mailing list at :

ZooKeeper, while tolerant against single node failures, doesn't react well to long partitioning events. For us, it's vastly more important that we maintain an available registry than a necessarily consistent registry. If us-east-1d sees 23 nodes, and us-east-1c sees 22 nodes for a little bit, that's OK with us.

I guess this means that a long partition can trigger SESSION_EXPIRED state, resulting in ZK client libraries requiring a restart/reconnect to fix. I'm not entirely clear what happens to the ZK cluster itself in this scenario though.

Finally, Pinterest ran into other issues relying on ZK for service discovery and registration, described at ; sounds like this was mainly around load and the "thundering herd" overload problem. Their workaround was to decouple ZK availability from their services' availability, by building a Smartstack-style sidecar daemon on each host which tracked/cached ZK data.
zookeeper  service-discovery  ops  ha  cap  ap  cp  service-registry  availability  ec2  aws  network  partitions  eureka  smartstack  pinterest 
november 2014 by jm
curl | sh
'People telling people to execute arbitrary code over the network. Run code from our servers as root. But HTTPS, so it’s no biggie.'

humor  sysadmin  ops  security  curl  bash  npm  rvm  chef 
november 2014 by jm
Elastic MapReduce vs S3
Turns out there are a few bugs in EMR's S3 support, believe it or not.

1. 'Consider disabling Hadoop's speculative execution feature if your cluster is experiencing Amazon S3 concurrency issues. You do this through the and mapred.reduce.tasks.speculative.execution configuration settings. This is also useful when you are troubleshooting a slow cluster.'

2. Upgrade to AMI 3.1.0 or later, otherwise retries of S3 ops don't work.
s3  emr  hadoop  aws  bugs  speculative-execution  ops 
october 2014 by jm
Stephanie Dean on event management and incident response
I asked around my ex-Amazon mates on twitter about good docs on incident response practices outside the "iron curtain", and they pointed me at this blog (which I didn't realise existed).

Stephanie Dean was the front-line ops manager for Amazon for many years, over the time where they basically *fixed* their availability problems. She since moved on to Facebook, Demonware, and Twitter. She really knows her stuff and this blog is FULL of great details of how they ran (and still run) front-line ops teams in Amazon.
ops  incident-response  outages  event-management  amazon  stephanie-dean  techops  tos  sev1 
october 2014 by jm
IT Change Management
Stephanie Dean on Amazon's approach to CMs. This is solid gold advice for any company planning to institute a sensible technical change management process
ops  tech  process  changes  change-management  bureaucracy  amazon  stephanie-dean  infrastructure 
october 2014 by jm
Carbon vs Megacarbon and Roadmap ? · Issue #235 · graphite-project/carbon
Carbon is a great idea, but fundamentally, twisted doesn't do what carbon-relay or carbon-aggregator were built to do when hit with sustained and heavy throughput. Much to my chagrin, concurrency isn't one of python's core competencies.

+1, sadly. We are patching around the edges with half-released third-party C rewrites in our graphite setup, as we exceed the scale Carbon can support.
carbon  graphite  metrics  ops  python  twisted  scalability 
october 2014 by jm
Game Day Exercises at Stripe: Learning from `kill -9`
We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.

Excellent post. Game days are a great idea. Also: massive Redis clustering fail
game-days  redis  testing  stripe  outages  ops  kill-9  failover 
october 2014 by jm
Is Docker ready for production? Feedbacks of a 2 weeks hands on
I have to agree with this assessment -- there are a lot of loose ends still for production use of Docker in a SOA stack environment:
From my point of view, Docker is probably the best thing I’ve seen in ages to automate a build. It allows to pre build and reuse shared dependencies, ensuring they’re up to date and reducing your build time. It avoids you to either pollute your Jenkins environment or boot a costly and slow Virtualbox virtual machine using Vagrant. But I don’t feel like it’s production ready in a complex environment, because it adds too much complexity. And I’m not even sure that’s what it was designed for.
docker  complexity  devops  ops  production  deployment  soa  web-services  provisioning  networking  logging 
october 2014 by jm
Linus Torvalds and others on Linux's systemd
ZDNet's Steven J. Vaughan-Nichols on the systemd mess (via Kragen)
via:kragen  systemd  linux  ubuntu  gnome  init  ops 
october 2014 by jm
a simple, lightweight HTTP server for storing and distributing custom Debian packages around your organisation. It is designed to make it as easy as possible to use Debian packages for code deployments and to ease other system administration tasks.
debian  apt  sysadmin  linux  ops  packaging 
october 2014 by jm
Netflix release new code to production before completing tests
Interesting -- I hadn't heard of this being an official practise anywhere before (although we actually did it ourselves this week)...
If a build has made it [past the 'integration test' phase], it is ready to be deployed to one or more internal environments for user-acceptance testing. Users could be UI developers implementing a new feature using the API, UI Testers performing end-to-end testing or automated UI regression tests. As far as possible, we strive to not have user-acceptance tests be a gating factor for our deployments. We do this by wrapping functionality in Feature Flags so that it is turned off in Production while testing is happening in other environments. 
devops  deployment  feature-flags  release  testing  integration-tests  uat  qa  production  ops  gating  netflix 
october 2014 by jm
"Linux Containers And The Future Cloud" [slides]
by Rami Rosen -- extremely detailed presentation into the state of Linux containers, LXC, Docker, namespaces, cgroups, and checkpoint/restore in userspace (via lusis)
lsx  docker  criu  namespaces  cgroups  linux  via:lusis  ops  containers  rami-rosen  presentations 
october 2014 by jm
Mike Perham on Twitter: "Sweet, monit just sent a DMCA takedown notice to @github to remove Inspeqtor."
'The work, Inspeqtor which is hosted at GitHub, is far from a “clean-room” implementation. This is basically a rewrite of Monit in Go, even using the same configuration language that is used in Monit, verbatim.

a. [private] himself admits that Inspeqtor is "heavily influenced“ by Monit

b. This tweet by [private] demonstrate intent. "OSS nerds: redesign and build monit in Go. Sell it commercially. Make $$$$. I will be your first customer.”'

IANAL, but using the same config language does not demonstrate copyright infringement...
copyright  dmca  tildeslash  monit  inspeqtor  github  ops  oss  agpl 
october 2014 by jm
'a set of command line tools for managing Route53 DNS for an AWS infrastructure. It intelligently uses tags and other metadata to automatically create the associated DNS records.'
zonify  aws  dns  ec2  route53  ops 
october 2014 by jm
'a system for allowing servers with encrypted root file systems to reboot unattended and/or remotely.' (via Tony Finch)
via:fanf  mandos  encryption  security  server  ops  sysadmin  linux 
october 2014 by jm
The End of Linux
'Linux is becoming the thing that we adopted Linux to get away from.'

Great post on the horrible complexity of systemd. It reminds me of nothing more than mid-90s AIX, which I had the displeasure of opsing for a while -- the Linux distros have taken a very wrong turn here.
linux  unix  complexity  compatibility  ops  rant  systemd  bloat  aix 
september 2014 by jm
Inviso: Visualizing Hadoop Performance
With the increasing size and complexity of Hadoop deployments, being able to locate and understand performance is key to running an efficient platform.  Inviso provides a convenient view of the inner workings of jobs and platform.  By simply overlaying a new view on existing infrastructure, Inviso can operate inside any Hadoop environment with a small footprint and provide easy access and insight.  

This sounds pretty useful.
inviso  netflix  hadoop  emr  performance  ops  tools 
september 2014 by jm
Avoiding Chef-Suck with Auto Scaling Groups - forty9ten
Some common problems which arise using Chef with ASGs in EC2, and how these guys avoided it -- they stopped using Chef for service provisioning, and instead baked AMIs when a new version was released. ASGs using pre-baked AMIs definitely works well so this makes good sense IMO.
infrastructure  chef  ops  asg  auto-scaling  ec2  provisioning  deployment 
september 2014 by jm
get page cache statistics for files.
A common question when tuning databases and other IO-intensive applications is, "is Linux caching my data or not?" pcstat gets that information for you using the mincore(2) syscall. I wrote this is so that Apache Cassandra users can see if ssTables are being cached.
linux  page-cache  caching  go  performance  cassandra  ops  mincore  fincore 
september 2014 by jm
Troubleshooting Production JVMs with jcmd
remotely trigger GCs, finalization, heap dumps etc. Handy
jvm  jcmd  debugging  ops  java  gc  heap  troubleshooting 
september 2014 by jm
The State of ZFS on Linux
Linux users familiar with other filesystems or ZFS users from other platforms will often ask whether ZFS on Linux (ZoL) is “stable”. The short answer is yes, depending on your definition of stable. The term stable itself is somewhat ambiguous.

Oh dear. that's not a good start. Good reference page, though
zfs  linux  filesystems  ops  solaris 
september 2014 by jm
'turns a fresh cloud computer into a working mail server. You get contact synchronization, spam filtering, and so on. On your phone, you can use apps like K-9 Mail and CardDAV-Sync free beta to sync your email and contacts between your phone and your box.'

(via Tony Finch)
via:fanf  mail  diy  hosting  webmail  ops 
september 2014 by jm
Using spot instances
Excellent post on all of the ins and outs of EC2 spot instance usage
ec2  aws  spot-instances  pricing  cloud  auto-scaling  ops 
september 2014 by jm
Nix: The Purely Functional Package Manager
'a powerful package manager for Linux and other Unix systems that makes package management reliable and reproducible. It provides atomic upgrades and rollbacks, side-by-side installation of multiple versions of a package, multi-user package management and easy setup of build environments. '

Basically, this is a third-party open source reimplementation of Amazon's (excellent) internal packaging system, using symlinks to versioned package directories to ensure atomicity and the ability to roll back. This is definitely the *right* way to build packages -- I know what tool I'll be pushing for, next time this question comes up.

See also for a Linux distro built on Nix.
ops  linux  devops  unix  packaging  distros  nix  nixos  atomic  upgrades  rollback  versioning 
september 2014 by jm
Applying cardiac alarm management techniques to your on-call
An ops-focused take on a recent story about alarm fatigue, and how a Boston hospital dealt with it. When I was in Amazon, many of the teams in our division had a target to reduce false positive pages, with a definite monetary value attached to it, since many teams had "time off in lieu" payments for out-of-hours pages to the on-call staff. As a result, reducing false-positive pages was reasonably high priority and we dealt with this problem very proactively, with a well-developed sense of how to do so. It's interesting to see how the outside world is only just starting to look into its amelioration. (Another benefit of a TOIL policy ;)
ops  monitoring  sysadmin  alerts  alarms  nagios  alarm-fatigue  false-positives  pages 
september 2014 by jm
On-Demand Jenkins Slaves With Amazon EC2
This is very likely where we'll be going for our acceptance tests in Swrve
testing  jenkins  ec2  spot-instances  scalability  auto-scaling  ops  build 
august 2014 by jm
Apache Kafka 0.8 basic training
This is a pretty voluminous and authoritative presentation about getting started with Kafka; wish this was around when we started using it for 0.7. (We use our own homegrown realtime system nowadays, due to better partitioning, monitoring and operability.)
storm  kafka  presentations  documentation  ops 
august 2014 by jm
« earlier      
per page:    204080120160

related tags

accept  accidents  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  advice  agpl  airbnb  aix  alarm-fatigue  alarming  alarms  alert-logic  alerting  alerts  alestic  algorithms  alter-table  ama  amazon  analysis  analytics  anomaly-detection  antarctica  anti-spam  antipatterns  ap  apache  aphyr  app-engine  apt  archaius  architecture  asg  asgard  atlas  atomic  authentication  auto-scaling  automation  autoremediation  autoscaling  availability  aws  az  azure  backblaze  backlog  backup  backups  banking  baron-schwartz  bash  basho  batch  bdb  bdb-je  bdd  beanstalk  ben-treynor  benchmarking  benchmarks  best-practices  big-data  billing  bind  bit-errors  bitcoin  bitly  bitrot  bloat  blockdev  blogs  blue-green-deployments  boot2docker  boundary  broadcast  bugs  build  build-out  bureaucracy  c  ca-7  caching  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  case-studies  cassandra  cdn  censum  certs  cfengine  cgroups  change-management  change-monitoring  changes  chaos-monkey  charts  checklists  chef  chefspec  china  ci  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cli  cloud  cloudera  cloudnative  cloudwatch  cluster  clustering  clusters  cms  code-spaces  codedeploy  coding  coinbase  cold  collaboration  command-line  commandline  commercial  company  compatibility  complexity  compression  concurrency  conferences  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy-on-write  copyright  coreutils  corruption  cp  crash-only-software  criu  cron  crypto  culture  curl  daemon  daemons  dashboards  data  data-corruption  database  databases  datacenters  dataviz  debian  debug  debugging  decay  defrag  delete  delivery  delta  demo  dependencies  deploy  deployinator  deployment  desktops  dev  developers  development  devops  diagnosis  digital-ocean  disk  disks  distcomp  distributed  distributed-systems  distros  diy  dmca  dns  docker  documentation  dotcloud  drivers  dropbox  dstat  duplicity  duply  dynamic-configuration  dynamodb  dynect  ebs  ec2  ecs  elastic-scaling  elasticsearch  elb  email  emr  encryption  engineering  ensemble  erasure-coding  etsy  eureka  event-management  eventual-consistency  exception-handling  exercises  exponential-decay  extortion  fabric  facebook  facette  fail  failover  failure  false-positives  fault-tolerance  fcron  feature-flags  fedora  file-transfer  filesystems  fincore  firefighting  five-whys  flapjack  flock  forecasting  foursquare  freebsd  front-ends  fs  fsync  ftrace  g1  g1gc  gae  game-days  games  gating  gc  gilt-groupe  git  github  gnome  go  god  google  gossip  grafana  graphing  graphite  graphs  gzip  ha  hacks  hadoop  haproxy  hardware  hbase  hdds  hdfs  heap  heartbeats  heka  hero-coder  hero-culture  hidden-costs  hn  holt-winters  home  honeypot  hosting  hotspot  hrd  http  https  huge-pages  humor  hvm  hystrix  iam  ian-wilkes  ibm  icecube  images  imaging  inactivity  incident-response  inept  influxdb  infrastructure  init  inspeqtor  instrumentation  integration-tests  internet  interviews  inviso  io  iops  iostat  ioutil  ip  iptables  ironfan  java  jay-kreps  jcmd  jdk  jemalloc  jenkins  jepsen  jmx  jmxtrans  john-allspaw  jstat  juniper  jvm  kafka  kdd  kde  kellabyte  kernel  key-rotation  kill-9  knife  laptops  latency  legacy  leveldb  lhm  libc  lifespan  limits  linden  linkedin  links  linode  linux  live  load  load-balancers  load-balancing  locking  logentries  logging  loose-coupling  lsb  lsof  lsx  luks  lxc  mac  machine-learning  macosx  madvise  mail  maintainance  mandos  map-reduce  mapreduce  measurements  memory  mesos  metrics  mfa  microsoft  migrations  mincore  mirroring  mit  ml  mmap  mongodb  monit  monitorama  monitoring  movies  mozilla  mtbf  multiplexing  mysql  nagios  namespaces  nannies  nas  natwest  nerve  netflix  netstat  network  network-monitoring  network-partitions  networking  networks  nginx  nix  nixos  nixpkgs  node.js  nosql  notification  notifications  npm  ntp  ntpd  obama  omniti  oom  oom-killer  open-source  openjdk  operations  ops  optimization  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outsourcing  packaging  page-cache  pager-duty  pagerduty  pages  paging  papers  partition  partitions  passenger  paxos  pbailis  pcp  pcp2graphite  pdf  peering  percona  performance  phusion  pie  pillar  pinterest  piops  pixar  post-mortems  postgres  postmortems  presentations  pricing  prioritisation  procedures  process  processes  procfs  production  profiling  programming  provisioning  pty  puppet  pv  python  qa  questions  queueing  rabbitmq  race-conditions  raid  rails  rami-rosen  randomization  ranking  rant  rate-limiting  rbs  rdbms  rds  recovery  red-hat  reddit  redis  redshift  refactoring  reference  regression-testing  reinvent  release  reliability  remediation  replicas  replication  resiliency  restarting  restoring  reviews  riak  riemann  risks  rm-rf  rollback  root-cause  root-causes  route53  routing  rspec  ruby  runbooks  runit  rvm  s3  s3funnel  s3ql  safety  sanity-checks  scala  scalability  scaling  scheduler  schema  scripts  sdd  seagate  search  security  sensu  serf  serialization  server  servers  serverspec  service-discovery  service-metrics  service-registry  services  ses  sev1  sharding  shodan  shopify  shorn-writes  silos  slashdot  sleep  slew  slides  smartstack  smtp  snappy  snapshots  sns  soa  sockets  software  solaris  soundcloud  south-pole  space  spark  spdy  speculative-execution  split-brain  spot-instances  spotify  sql  sre  ssd  ssh  ssl  stack  stack-size  startup  statistics  stats  statsd  statsite  stephanie-dean  stepping  storage  storm  strace  stream-processing  streaming  stress-testing  strider  stripe  supervision  supervisord  support  survey  svctm  syadmin  synapse  sysadmin  sysadvent  sysdig  syslog  system  system-testing  systemd  systems  tahoe-lafs  talks  tcp  tcpcopy  tcpdump  tdd  teams  tech  tech-debt  techops  tee  telefonica  telemetry  testing  thp  threadpools  threads  throughput  thundering-herd  tildeslash  time  time-machine  time-series  time-synchronization  tips  tls  tools  tos  trace  tracer-requests  tracing  trading  training  transactional-updates  transparent-huge-pages  troubleshooting  tuning  turing-complete  twilio  twisted  twitter  two-factor-authentication  uat  ubuntu  ubuntu-core  ui  ulster-bank  ultradns  unicorn  unit-testing  unit-tests  unix  upgrades  upstart  uptime  uselessd  usenix  vagrant  versioning  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:fanf  via:feylya  via:filippo  via:jk  via:kragen  via:lusis  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  virtualisation  virtualization  visualisation  vm  vms  voldemort  vpc  web  web-services  webmail  weighting  wiki  wipac  work  xen  yahoo  yammer  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark: