jm + ops   318

Global Continuous Delivery with Spinnaker
Netflix' CD platform, post-Atlas. looks interesting
continuous-delivery  aws  netflix  cd  devops  ops  atlas  spinnaker 
8 days ago by jm
The impact of Docker containers on the performance of genomic pipelines [PeerJ]
In this paper, we have assessed the impact of Docker containers technology on the performance of genomic pipelines, showing that container “virtualization” has a negligible overhead on pipeline performance when it is composed of medium/long running tasks, which is the most common scenario in computational genomic pipelines.

Interestingly for these tasks the observed standard deviation is smaller when running with Docker. This suggests that the execution with containers is more “homogeneous,” presumably due to the isolation provided by the container environment.

The performance degradation is more significant for pipelines where most of the tasks have a fine or very fine granularity (a few seconds or milliseconds). In this case, the container instantiation time, though small, cannot be ignored and produces a perceptible loss of performance.
performance  docker  ops  genomics  papers 
12 days ago by jm
Alarm design: From nuclear power to WebOps
Imagine you are an operator in a nuclear power control room. An accident has started to unfold. During the first few minutes, more than 100 alarms go off, and there is no system for suppressing the unimportant signals so that you can concentrate on the significant alarms. Information is not presented clearly; for example, although the pressure and temperature within the reactor coolant system are shown, there is no direct indication that the combination of pressure and temperature mean that the cooling water is turning into steam. There are over 50 alarms lit in the control room, and the computer printer registering alarms is running more than 2 hours behind the events.

This was the basic scenario facing the control room operators during the Three Mile Island (TMI) partial nuclear meltdown in 1979. The Report of the President’s Commission stated that, “Overall, little attention had been paid to the interaction between human beings and machines under the rapidly changing and confusing circumstances of an accident” (p. 11). The TMI control room operator on the day, Craig Faust, recalled for the Commission his reaction to the incessant alarms: “I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information”. It was the first major illustration of the alarm problem, and the accident triggered a flurry of human factors/ergonomics (HF/E) activity.

A familiar topic for this ex-member of the Amazon network monitoring team...
ergonomics  human-factors  ui  ux  alarms  alerts  alerting  three-mile-island  nuclear-power  safety  outages  ops 
14 days ago by jm
Awesome new mock DynamoDB implementation:
An implementation of Amazon's DynamoDB, focussed on correctness and performance, and built on LevelDB (well, @rvagg's awesome LevelUP to be precise). This project aims to match the live DynamoDB instances as closely as possible (and is tested against them in various regions), including all limits and error messages.

Why not Amazon's DynamoDB Local? Because it's too buggy! And it differs too much from the live instances in a number of key areas.

We use DynamoDBLocal in our tests -- the availability of that tool is one of the key reasons we have adopted Dynamo so heavily, since we can safely test our code properly with it. This looks even better.
dynamodb  testing  unit-tests  integration-testing  tests  ops  dynalite  aws  leveldb 
14 days ago by jm
Nobody Loves Graphite Anymore - VividCortex
Graphite has a place in our current monitoring stack, and together with StatsD will always have a special place in the hearts of DevOps practitioners everywhere, but it’s not representative of state-of-the-art in the last few years. Graphite is where the puck was in 2010. If you’re skating there, you’re missing the benefits of modern monitoring infrastructure.

The future I foresee is one where time series capabilities (the raw power needed, which I described in my time series requirements blog post, for example) are within everyone’s reach. That will be considered table stakes, whereas now it’s pretty revolutionary.

Like I've been saying -- we need Time Series As A Service! This should be undifferentiated heavy lifting.
graphite  tsd  time-series  vividcortex  statsd  ops  monitoring  metrics 
19 days ago by jm
How Facebook avoids failures
Great paper from Ben Maurer of Facebook in ACM Queue.
A "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves.

This is full of interesting techniques.

* Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert.

* Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills.

* Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm -- this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker).

* Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed?

* Learning from Failure: the DERP (!) methodology,
ben-maurer  facebook  reliability  algorithms  codel  circuit-breakers  derp  failure  ops  cubism  horizon-charts  charts  dependencies  soa  microservices  uptime  deployment  configuration  change-management 
23 days ago by jm
Structural and semantic deficiencies in the systemd architecture for real-world service management, a technical treatise
Despite its overarching abstractions, it is semantically non-uniform and its complicated transaction and job scheduling heuristics ordered around a dependently networked object system create pathological failure cases with little debugging context that would otherwise not necessarily occur on systems with less layers of indirection. The use of bus APIs complicate communication with the service manager and lead to duplication of the object model for little gain. Further, the unit file options often carry implicit state or are not sufficiently expressive. There is an imbalance with regards to features of an eager service manager and that of a lazy loading service manager, having rusty edge cases of both with non-generic, manager-specific facilities. The approach to logging and the circularly dependent architecture seem to imply that lots of prior art has been ignored or understudied.
analysis  systemd  linux  unix  ops  init  critiques  software  logging 
23 days ago by jm
Google Cloud Platform HTTP/HTTPS Load Balancing
GCE's LB product is pretty nice -- HTTP/2 support, and a built-in URL mapping feature (presumably based on how Google approach that problem internally, I understand they take that approach). I'm hoping AWS are taking notes for the next generation of ELB, if that ever happens
elb  gce  google  load-balancing  http  https  spdy  http2  urls  request-routing  ops  architecture  cloud 
26 days ago by jm
Google tears Symantec a new one on its CA failure
Symantec are getting a crash course in how to conduct an incident post-mortem to boot:
More immediately, we are requesting of Symantec that they further update their public incident report with:
A post-mortem analysis that details why they did not detect the additional certificates that we found.
Details of each of the failures to uphold the relevant Baseline Requirements and EV Guidelines and what they believe the individual root cause was for each failure.
We are also requesting that Symantec provide us with a detailed set of steps they will take to correct and prevent each of the identified failures, as well as a timeline for when they expect to complete such work. Symantec may consider this latter information to be confidential and so we are not requesting that this be made public.
google  symantec  ev  ssl  certificates  ca  security  postmortems  ops 
27 days ago by jm
Holistic Configuration Management at Facebook
How FB push config changes from Git (where it is code reviewed, version controlled, and history tracked with strong auth) to Zeus (their Zookeeper fork) and from there to live production servers.
facebook  configuration  zookeeper  git  ops  architecture 
5 weeks ago by jm
(ARC308) The Serverless Company: Using AWS Lambda
Describing PlayOn! Sports' Lambda setup. Sounds pretty productionizable
ops  lambda  aws  reinvent  slides  architecture 
5 weeks ago by jm
Designing the Spotify perimeter
How Spotify use nginx as a frontline for their sites and services
scaling  spotify  nginx  ops  architecture  ssl  tls  http  frontline  security 
6 weeks ago by jm
AWS re:Invent 2015 | (CMP406) Amazon ECS at Coursera - YouTube
Coursera are running user-submitted code in ECS! interesting stuff about how they use Docker security/resource-limiting features, forking the ecs-agent code, to run user-submitted code. :O
coursera  user-submitted-code  sandboxing  docker  security  ecs  aws  resource-limits  ops 
6 weeks ago by jm
SuperChief: From Apache Storm to In-House Distributed Stream Processing
Another sorry tale of Storm issues:
Storm has been successful at Librato, but we experienced many of the limitations cited in the Twitter Heron: Stream Processing at Scale paper and outlined here by Adrian Colyer, including:
Inability to isolate, reason about, or debug performance issues due to the worker/executor/task paradigm. This led to building and configuring clusters specifically designed to attempt to mitigate these problems (i.e., separate clusters per topology, only running a worker per server.), which added additional complexity to development and operations and also led to over-provisioning.
Ability of tasks to move around led to difficult to trace performance problems.
Storm’s work provisioning logic led to some tasks serving more Kafka partitions than others. This in turn created latency and performance issues that were difficult to reason about. The initial solution was to over-provision in an attempt to get a better hashing/balancing of work, but eventually we just replaced the work allocation logic.
Due to Storm’s architecture, it was very difficult to get a stack trace or heap dump because the processes that managed workers (Storm supervisor) would often forcefully kill a Java process while it was being investigated in this way.
The propensity for unexpected and subsequently unhandled exceptions to take down an entire worker led to additional defensive verbose error handling everywhere.
This nasty bug STORM-404 coupled with the aforementioned fact that a single exception can take down a worker led to several cascading failures in production, taking down entire topologies until we upgraded to 0.9.4.
Additionally, we found the performance we were getting from Storm for the amount of money we were spending on infrastructure was not in line with our expectations. Much of this is due to the fact that, depending upon how your topology is designed, a single tuple may make multiple hops across JVMs, and this is very expensive. For example, in our time series aggregation topologies a single tuple may be serialized/deserialized and shipped across the wire 3-4 times as it progresses through the processing pipeline.
scalability  storm  kafka  librato  architecture  heron  ops 
6 weeks ago by jm
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
There was a breakdown in communication between the developer who requested the index migration and the database operator who deleted the old index. Instead of working on the migration together, they communicated in an implicit way through flawed tooling. The dashboard that surfaced the migration request was missing important context: the reason for the requested deletion, the dependency on another index’s creation, and the criticality of the index for API traffic. Indeed, the database operator didn’t have a way to check whether the index had recently been used for a query.

Good demo of how the Etsy-style chatops deployment approach would have helped avoid this risk.
stripe  postmortem  outages  databases  indexes  deployment  chatops  deploy  ops 
6 weeks ago by jm
How IFTTT develop with Docker
ugh, quite a bit of complexity here
docker  osx  dev  ops  building  coding  ifttt  dns  dnsmasq 
6 weeks ago by jm
The Totally Managed Analytics Pipeline: Segment, Lambda, and Dynamo
notable mainly for the details of Terraform support for Lambda: that's a significant improvement to Lambda's production-readiness
aws  pipelines  data  streaming  lambda  dynamodb  analytics  terraform  ops 
6 weeks ago by jm
Rebuilding Our Infrastructure with Docker, ECS, and Terraform
Good writeup of current best practices for a production AWS architecture
aws  ops  docker  ecs  ec2  prod  terraform  segment  via:marc 
6 weeks ago by jm
Elasticsearch and data loss
"@alexbfree @ThijsFeryn [ElasticSearch is] fine as long as data loss is acceptable. . We lose ~1% of all writes on average."
elasticsearch  data-loss  reliability  data  search  aphyr  jepsen  testing  distributed-systems  ops 
7 weeks ago by jm
Træfɪk is a modern HTTP reverse proxy and load balancer made to deploy microservices with ease. It supports several backends (Docker , Mesos/Marathon, Consul, Etcd, Rest API, file...) to manage its configuration automatically and dynamically.

Hot-reloading is notably much easier than with nginx/haproxy.
proxy  http  proxying  reverse-proxy  traefik  go  ops 
8 weeks ago by jm
Chaos Engineering Upgraded
some details on Netflix's Chaos Monkey, Chaos Kong and other aspects of their availability/failover testing
architecture  aws  netflix  ops  chaos-monkey  chaos-kong  testing  availability  failover  ha 
8 weeks ago by jm
a tool which simplifies tracing and testing of Java programs. Byteman allows you to insert extra Java code into your application, either as it is loaded during JVM startup or even after it has already started running. The injected code is allowed to access any of your data and call any application methods, including where they are private. You can inject code almost anywhere you want and there is no need to prepare the original source code in advance nor do you have to recompile, repackage or redeploy your application. In fact you can remove injected code and reinstall different code while the application continues to execute. The simplest use of Byteman is to install code which traces what your application is doing. This can be used for monitoring or debugging live deployments as well as for instrumenting code under test so that you can be sure it has operated correctly. By injecting code at very specific locations you can avoid the overheads which often arise when you switch on debug or product trace. Also, you decide what to trace when you run your application rather than when you write it so you don't need 100% hindsight to be able to obtain the information you need.
tracing  java  byteman  injection  jvm  ops  debugging  testing 
8 weeks ago by jm
a specialized packet sniffer designed for displaying and logging HTTP traffic. It is not intended to perform analysis itself, but to capture, parse, and log the traffic for later analysis. It can be run in real-time displaying the traffic as it is parsed, or as a daemon process that logs to an output file. It is written to be as lightweight and flexible as possible, so that it can be easily adaptable to different applications.

via Eoin Brazil
via:eoinbrazil  httpry  http  networking  tools  ops  testing  tcpdump  tracing 
9 weeks ago by jm
Anatomy of a Modern Production Stack
Interesting post, but I think it falls into a common trap for the xoogler or ex-Amazonian -- assuming that all the BigCo mod cons are required to operate, when some are luxuries than can be skipped for a few years to get some real products built
architecture  ops  stack  docker  containerization  deployment  containers  rkt  coreos  prod  monitoring  xooglers 
9 weeks ago by jm
You're probably wrong about caching
Excellent cut-out-and-keep guide to why you should add a caching layer. I've been following this practice for the past few years, after I realised that #6 (recovering from a failed cache is hard) is a killer -- I've seen a few large-scale outages where a production system had gained enough scale that it required a cache to operate, and once that cache was damaged, bringing the system back online required a painful rewarming protocol. Better to design for the non-cached case if possible.
architecture  caching  coding  design  caches  ops  production  scalability 
11 weeks ago by jm
Docker image creation, tagging and traceability in Shippable
this is starting to look quite impressive as a well-integrated Docker-meets-CI model; Shippable is basing its builds off Docker baselines and is automatically cutting Docker images of the post-CI stage. Must take another look
shippable  docker  ci  ops  dev  continuous-integration 
august 2015 by jm
Call me Maybe: Chronos
Chronos (the Mesos distributed scheduler) comes out looking pretty crappy here
aphyr  mesos  chronos  cron  scheduling  outages  ops  jepsen  testing  partitions  cap 
august 2015 by jm
Cleanup old/obsolete Docker images in a repo.
disk-space  ops  docker  cleanup  cron 
august 2015 by jm
our full-featured, high performance, scalable web server designed to compete with the likes of nginx. It has been built from the ground-up with no external library dependencies entirely in x86_64 assembly language, and is the result of many years' experience with high volume web environments. In addition to all of the common things you'd expect a modern web server to do, we also include assembly language function hooks ready-made to facilitate Rapid Web Application Server (in Assembler) development.
assembly  http  performance  https  ssl  x86_64  web  ops  rwasa  tls 
august 2015 by jm
A collection of postmortems
A well-maintained list with a potted description of each one (via HN)
postmortems  ops  uptime  reliability 
august 2015 by jm
Introducing Nurse: Auto-Remediation at LinkedIn
Interesting to hear about auto-remediation in prod -- we built a (very targeted) auto-remediation system in Amazon on the Network Monitoring team, but this is much bigger in focus
nurse  auto-remediation  outages  linkedin  ops  monitoring 
august 2015 by jm
danilop/runjop · GitHub
RunJOP (Run Just Once Please) is a distributed execution framework to run a command (i.e. a job) only once in a group of servers [built using AWS DynamoDB and S3].

nifty! Distributed cron is pretty easy when you've got Dynamo doing the heavy lifting.
dynamodb  cron  distributed-cron  scheduling  runjop  danilop  hacks  aws  ops 
july 2015 by jm
Why Docker is Not Yet Succeeding Widely in Production
Spot-on points which Docker needs to address. It's still production-ready, and _should_ be used there, it just has significant rough edges...
docker  containers  devops  deployment  releases  linux  ops 
july 2015 by jm
Taming Complexity with Reversibility
This is a great post from Kent Beck, putting a lot of recent deployment/rollout patterns in a clear context -- that of supporting "reversibility":
Development servers. Each engineer has their own copy of the entire site. Engineers can make a change, see the consequences, and reverse the change in seconds without affecting anyone else.
Code review. Engineers can propose a change, get feedback, and improve or abandon it in minutes or hours, all before affecting any people using Facebook.
Internal usage. Engineers can make a change, get feedback from thousands of employees using the change, and roll it back in an hour.
Staged rollout. We can begin deploying a change to a billion people and, if the metrics tank, take it back before problems affect most people using Facebook.
Dynamic configuration. If an engineer has planned for it in the code, we can turn off an offending feature in production in seconds. Alternatively, we can dial features up and down in tiny increments (i.e. only 0.1% of people see the feature) to discover and avoid non-linear effects.
Correlation. Our correlation tools let us easily see the unexpected consequences of features so we know to turn them off even when those consequences aren't obvious.
IRC. We can roll out features potentially affecting our ability to communicate internally via Facebook because we have uncorrelated communication channels like IRC and phones.
Right hand side units. We can add a little bit of functionality to the website and turn it on and off in seconds, all without interfering with people's primary interaction with NewsFeed.
Shadow production. We can experiment with new services under real load, from a tiny trickle to the whole flood, without affecting production.
Frequent pushes. Reversing some changes require a code change. On the website we never more than eight hours from the next schedule code push (minutes if a fix is urgent and you are willing to compensate Release Engineering). The time frame for code reversibility on the mobile applications is longer, but the downward trend is clear from six weeks to four to (currently) two.
Data-informed decisions. (Thanks to Dave Cleal) Data-informed decisions are inherently reversible (with the exceptions noted below). "We expect this feature to affect this metric. If it doesn't, it's gone."
Advance countries. We can roll a feature out to a whole country, generate accurate feedback, and roll it back without affecting most of the people using Facebook.
Soft launches. When we roll out a feature or application with a minimum of fanfare it can be pulled back with a minimum of public attention.
Double write/bulk migrate/double read. Even as fundamental a decision as storage format is reversible if we follow this format: start writing all new data to the new data store, migrate all the old data, then start reading from the new data store in parallel with the old.

We do a bunch of these in work, and the rest are on the to-do list. +1 to these!
software  deployment  complexity  systems  facebook  reversibility  dark-releases  releases  ops  cd  migration 
july 2015 by jm
Benchmarking GitHub Enterprise - GitHub Engineering
Walkthrough of debugging connection timeouts in a load test. Nice graphs (using matplotlib)
github  listen-backlog  tcp  debugging  timeouts  load-testing  benchmarking  testing  ops  linux 
july 2015 by jm
Mikhail Panchenko's thoughts on the July 2015 CircleCI outage
an excellent followup operational post on CircleCI's "database is not a queue" outage
database-is-not-a-queue  mysql  sql  databases  ops  outages  postmortems 
july 2015 by jm
From Zero to Docker: Migrating to the Whale
nicely detailed writeup of how New Relic are dockerizing
docker  ops  deployment  packaging  new-relic 
july 2015 by jm
Outlier Detection at Netflix | Hacker News
Excellent HN thread re automated anomaly detection in production, Q&A with the dev team
machine-learning  ml  remediation  anomaly-detection  netflix  ops  time-series  clustering 
july 2015 by jm
a command line tool for JVM diagnostic troubleshooting and profiling.
java  jvm  monitoring  commandline  jmx  sjk  tools  ops 
june 2015 by jm
Cloudflare's open source CA/PKI infrastructure app
cloudflare  pki  ca  ssl  tls  ops 
june 2015 by jm
Docker at Shopify: From This-Looks-Fun to Production
Pragmatic evolution story, adding Docker as a packaging/deploy format for an existing production Capistrano/Rails fleet
docker  ops  deployment  packaging  shopify  slides 
june 2015 by jm
Google Cloud Platform announces new Container Registry
Yay. Sensible Docker registry pricing at last. Given the high prices, rough edges and slow performance of the other registry offerings, I'm quite happy to see this.
Google Container Registry helps make it easy for you to store your container images in a private and encrypted registry, built on Cloud Platform. Pricing for storing images in Container Registry is simple: you only pay Google Cloud Storage costs. Pushing images is free, and pulling Docker images within a Google Cloud Platform region is free (Cloud Storage egress cost when outside of a region).

Container Registry is now ready for production use:

* Encrypted and Authenticated - Your container images are encrypted at rest, and access is authenticated using Cloud Platform OAuth and transmitted over SSL
* Fast - Container Registry is fast and can handle the demands of your application, because it is built on Cloud Storage and Cloud Networking.
* Simple - If you’re using Docker, just tag your image with a tag and push it to the registry to get started.  Manage your images in the Google Developers Console.
* Local - If your cluster runs in Asia or Europe, you can now store your images in ASIA or EU specific repositories using and tags.
docker  registry  google  gcp  containers  cloud-storage  ops  deployment 
june 2015 by jm
Automated Nginx Reverse Proxy for Docker
Nice hack. An automated nginx reverse proxy which regenerates as the Docker containers update
nginx  reverse-proxy  proxies  web  http  ops  docker 
june 2015 by jm
Google Cloud Platform Blog: A look inside Google’s Data Center Networks
We used three key principles in designing our datacenter networks:
We arrange our network around a Clos topology, a network configuration where a collection of smaller (cheaper) switches are arranged to provide the properties of a much larger logical switch.
We use a centralized software control stack to manage thousands of switches within the data center, making them effectively act as one large fabric.
We build our own software and hardware using silicon from vendors, relying less on standard Internet protocols and more on custom protocols tailored to the data center.
clos-networks  google  data-centers  networking  sdn  gcp  ops 
june 2015 by jm
Why I dislike systemd
Good post, and hard to disagree.
One of the "features" of systemd is that it allows you to boot a system without needing a shell at all. This seems like such a senseless manoeuvre that I can't help but think of it as a knee-jerk reaction to the perception of Too Much Shell in sysv init scripts.
In exactly which universe is it reasonable to assume that you have a running D-Bus service (or kdbus) and a filesystem containing unit files, all the binaries they refer to, all the libraries they link against, and all the configuration files any of them reference, but that you lack that most ubiquitous of UNIX binaries, /bin/sh?
history  linux  unix  systemd  bsd  system-v  init  ops  dbus 
june 2015 by jm
VPC Flow Logs
we are introducing Flow Logs for the Amazon Virtual Private Cloud.  Once enabled for a particular VPC, VPC subnet, or Elastic Network Interface (ENI), relevant network traffic will be logged to CloudWatch Logs for storage and analysis by your own applications or third-party tools.

You can create alarms that will fire if certain types of traffic are detected; you can also create metrics to help you to identify trends and patterns. The information captured includes information about allowed and denied traffic (based on security group and network ACL rules). It also includes source and destination IP addresses, ports, the IANA protocol number, packet and byte counts, a time interval during which the flow was observed, and an action (ACCEPT or REJECT).
ec2  aws  vpc  logging  tracing  ops  flow-logs  network  tcpdump  packets  packet-capture 
june 2015 by jm
How We Moved Our API From Ruby to Go and Saved Our Sanity
Parse on their ditching-Rails story. I haven't heard a nice thing about Ruby or Rails as an operational, production-quality platform in a long time :(
go  ruby  rails  ops  parse  languages  platforms 
june 2015 by jm
etcd Clustering in AWS
'a fully-automated solution to build auto-scaling etcd clusters in AWS'
aws  cluster  docker  etcd  asg  autoscaling  ops 
june 2015 by jm
1172401 – Add Amazon root certificates
Well, well -- looks like AWS is about to disrupt PKI, and about time too. If they come up with a Plex-style "provision a cert" API, it'll be revolutionary
pki  ssl  tls  amazon  aws  apis  web-services  ops 
june 2015 by jm
Eric Brewer interview on Kubernetes
What is the relationship between Kubernetes, Borg and Omega (the two internal resource-orchestration systems Google has built)?

I would say, kind of by definition, there’s no shared code but there are shared people.

You can think of Kubernetes — especially some of the elements around pods and labels — as being lessons learned from Borg and Omega that are, frankly, significantly better in Kubernetes. There are things that are going to end up being the same as Borg — like the way we use IP addresses is very similar — but other things, like labels, are actually much better than what we did internally.

I would say that’s a lesson we learned the hard way.
google  architecture  kubernetes  docker  containers  borg  omega  deployment  ops 
may 2015 by jm
Deploy a registry - Docker Documentation
Looks like it's pretty feasible to run a private Docker registry on every host, backed by S3 (according to the ECS team's AMA). SPOF-free -- handy
docker  registry  ops  deployment  s3 
may 2015 by jm
Migration to, Expectations, and Advanced Tuning of G1GC
Bookmarking for future reference. recommended by one of the GC experts, I can't recall exactly who ;)
gc  g1gc  jvm  java  tuning  performance  ops  migration 
may 2015 by jm
Patterns for building a resilient and scalable microservices platform on AWS
Some good details from Boyan Dimitrov at Hailo, on their orchestration, deployment, provisioning infra they've built
deployment  ops  devops  hailo  microservices  platform  patterns  slides 
may 2015 by jm
Why Loggly loves Apache Kafka
Some good factoids about Loggly's Kafka usage and scales
scalability  logging  loggly  kafka  queueing  ops  reliabilty 
may 2015 by jm
Cassandra moving to using G1 as the default recommended GC implementation
This is a big indicator that G1 is ready for primetime. CMS has long been the go-to GC for production usage, but requires careful, complex hand-tuning -- if G1 is getting to a stage where it's just a case of giving it enough RAM, that'd be great.

Also, looks like it'll be the JDK9 default:
cassandra  tuning  ops  g1gc  cms  gc  java  jvm  production  performance  memory 
april 2015 by jm
a web-based SSH console that centrally manages administrative access to systems. Web-based administration is combined with management and distribution of user's public SSH keys. Key management and administration is based on profiles assigned to defined users.

Administrators can login using two-factor authentication with FreeOTP or Google Authenticator . From there they can create and manage public SSH keys or connect to their assigned systems through a web-shell. Commands can be shared across shells to make patching easier and eliminate redundant command execution.
keybox  owasp  security  ssh  tls  ssl  ops 
april 2015 by jm
'Discover and discuss the best dev tools and cloud infrastructure services' -- fun!
stackshare  architecture  stack  ops  software  ranking  open-source 
april 2015 by jm
Kubernetes compared to Borg
'Here are four Kubernetes features that came from our experiences with Borg.'
google  ops  kubernetes  borg  containers  docker  networking 
april 2015 by jm
Cluster-Based Architectures Using Docker and Amazon EC2 Container Service
In this post, we’re going to take a deeper dive into the architectural concepts underlying cluster computing using container management frameworks such as ECS. We will show how these frameworks effectively abstract the low-level resources such as CPU, memory, and storage, allowing for highly efficient usage of the nodes in a compute cluster. Building on some of the concepts detailed in the earlier posts, we will discover why containers are such a good fit for this type of abstraction, and how the Amazon EC2 Container Service fits into the larger ecosystem of cluster management frameworks.
docker  aws  ecs  ec2  ops  hosting  containers  mesos  clusters 
april 2015 by jm
Amazon EC2 Container Service team AmA
a few answers here. Mostly people pointing out shortcomings and the team asking them to start a thread on their forum though :(
ec2  ecs  docker  aws  ops  ama  reddit 
april 2015 by jm
Etsy's Release Management process
Good info on how Etsy use their Deployinator tool, end-to-end.

Slide 11: git SHA is visible for each env, allowing easy verification of what code is deployed.

Slide 14: Code is deployed to "princess" staging env while CI tests are running; no need to wait for unit/CI tests to complete.

Slide 23: smoke tests of pre-prod "princess" (complete after 8 mins elapsed).

Slide 31: dashboard link for deployed code is posted during deploy; post-release prod smoke tests are run by Jenkins. (short ones! they complete in 42 seconds)
deployment  etsy  deploy  deployinator  princess  staging  ops  testing  devops  smoke-tests  production  jenkins 
april 2015 by jm
'Continuous Deployment: The Dirty Details'
Good slide deck from Etsy's Mike Brittain regarding their CD setup. Some interesting little-known details:

Slide 41: database schema changes are not CD'd -- they go out on "Schema change Thursdays".

Slide 44: only the webapp is CD'd -- PHP, Apache, memcache components (, support and back-office tools, developer API, gearman async worker queues). The external "services" are not -- databases, Solr/JVM search (rolling restarts), photo storage (filters, proxy cache, S3), payments (PCI-DSS, controlled access).

They avoid schema changes and breaking changes using an approach they call "non-breaking expansions" -- expose new version in a service interface; support multiple versions in the consumer. Example from slides 50-63, based around a database schema migration.

Slide 66: "dev flags" (rollout oriented) are promoted to "feature flags" (long lived degradation control).

Slide 71: some architectural philosophies: deploying is cheap; releasing is cheap; gathering data should be cheap too; treat first iterations as experiments.

Slide 102: "Canary pools". They have multiple pools of users for testing in production -- the staff pool, users who have opted in to see prototypes/beta stuff, 0-100% gradual phased rollout.
cd  deploy  etsy  slides  migrations  database  schema  ops  ci  version-control  feature-flags 
april 2015 by jm
Internet Scale Services Checklist
good aspirational checklist, inspired heavily by James Hamilton's seminal 2007 paper, "On Designing And Deploying Internet-Scale Services"
james-hamilton  checklists  ops  internet-scale  architecture  operability  monitoring  reliability  availability  uptime  aspirations 
april 2015 by jm
Pinterest's Hadoop workflow manager; 'scalable, reliable, simple, extensible' apparently. Hopefully it allows upgrades of a workflow component without breaking an existing run in progress, like LinkedIn's Azkaban does :(
python  pinterest  hadoop  workflows  ops  pinball  big-data  scheduling 
april 2015 by jm
'a secret management and distribution service [from Square] that is now available for everyone. Keywhiz helps us with infrastructure secrets, including TLS certificates and keys, GPG keyrings, symmetric keys, database credentials, API tokens, and SSH keys for external services — and even some non-secrets like TLS trust stores. Automation with Keywhiz allows us to seamlessly distribute and generate the necessary secrets for our services, which provides a consistent and secure environment, and ultimately helps us ship faster. [...]

Keywhiz has been extremely useful to Square. It’s supported both widespread internal use of cryptography and a dynamic microservice architecture. Initially, Keywhiz use decoupled many amalgamations of configuration from secret content, which made secrets more secure and configuration more accessible. Over time, improvements have led to engineers not even realizing Keywhiz is there. It just works. Please check it out.'
square  security  ops  keys  pki  key-distribution  key-rotation  fuse  linux  deployment  secrets  keywhiz 
april 2015 by jm
« earlier      
per page:    204080120160

related tags

10/8  accept  accidents  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  advice  agpl  airbnb  aix  alarm-fatigue  alarming  alarms  alert-logic  alerting  alerts  alestic  algorithms  allspaw  alter-table  ama  amazon  analysis  analytics  anomaly-detection  antarctica  anti-spam  antipatterns  ap  apache  aphyr  apis  app-engine  apt  archaius  architecture  asg  asgard  aspirations  assembly  atlas  atomic  aufs  authentication  auto-remediation  auto-scaling  automation  autoremediation  autoscaling  availability  aws  az  azure  backblaze  backlog  backpressure  backup  backups  banking  baron-schwartz  bash  basho  batch  bdb  bdb-je  bdd  beanstalk  ben-maurer  ben-treynor  benchmarking  benchmarks  best-practices  big-data  billing  bind  bit-errors  bitcoin  bitly  bitrot  bloat  blockdev  blogs  blue-green-deployments  boot2docker  borg  boundary  broadcast  bsd  btrfs  bugs  build  build-out  building  bureaucracy  byteman  c  ca  ca-7  caches  caching  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  case-studies  cassandra  cd  cdn  censum  certificates  certs  cfengine  cgroups  change-management  change-monitoring  changes  chaos-kong  chaos-monkey  charts  chatops  checkip  checklists  chef  chefspec  china  chronos  ci  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cleanup  cli  clos-networks  cloud  cloud-storage  cloudera  cloudflare  cloudnative  cloudwatch  cluster  clustering  clusters  cms  code-spaces  codeascraft  codedeploy  codel  coding  coes  coinbase  cold  collaboration  command-line  commandline  commercial  company  compatibility  complexity  compression  concurrency  conferences  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy-on-write  copyright  coreos  coreutils  corruption  coursera  cp  crash-only-software  critiques  criu  cron  crypto  cubism  culture  curl  daemon  daemons  danilop  dark-releases  dashboards  data  data-centers  data-corruption  data-loss  database  database-is-not-a-queue  databases  datacenters  datadog  dataviz  dbus  debian  debug  debugging  decay  defrag  delete  delivery  delta  demo  dependencies  deploy  deployinator  deployment  derp  design  desktops  dev  developers  development  devops  diagnosis  digital-ocean  disk  disk-space  disks  distcomp  distributed  distributed-cron  distributed-systems  distros  diy  dmca  dns  dnsmasq  docker  documentation  dotcloud  drivers  dropbox  dstat  duplicity  duply  dynalite  dynamic-configuration  dynamodb  dynect  ebs  ec2  ecs  elastic-scaling  elasticsearch  elb  email  emr  encryption  engineering  ensemble  erasure-coding  ergonomics  etcd  etsy  eureka  ev  event-management  eventual-consistency  exception-handling  exercises  exponential-decay  extortion  fabric  facebook  facette  fail  failover  failure  false-positives  fault-tolerance  fcron  feature-flags  fedora  file-transfer  filesystems  fincore  firefighting  five-whys  flapjack  flock  flow-logs  forecasting  foursquare  freebsd  front-ends  frontline  fs  fsync  ftrace  fuse  g1  g1gc  gae  game-days  games  gating  gc  gce  gcp  genomics  gil-tene  gilt  gilt-groupe  git  github  gnome  go  god  google  gossip  grafana  graphing  graphite  graphs  gruffalo  gzip  ha  hacks  hadoop  hailo  haproxy  hardware  hbase  hdds  hdfs  heap  heartbeats  heka  hero-coder  hero-culture  heron  hiccups  hidden-costs  history  hn  holt-winters  home  honeypot  horizon-charts  hosting  hotspot  howto  hrd  http  http2  httpry  https  huge-pages  human-factors  humor  hvm  hyperthreading  hystrix  iam  ian-wilkes  ibm  icecube  ifttt  images  imaging  inactivity  incident-response  incidents  indexes  inept  influxdb  infrastructure  init  injection  inspeqtor  instrumentation  integration-testing  integration-tests  internet  internet-scale  interviews  inviso  io  iops  iostat  ioutil  ip  ip-addresses  iptables  ironfan  james-hamilton  java  javascript  jay-kreps  jcmd  jdk  jemalloc  jenkins  jepsen  jmx  jmxtrans  john-allspaw  joyent  jstat  juniper  jvm  kafka  kdd  kde  kellabyte  kernel  key-distribution  key-rotation  keybox  keys  keywhiz  kill-9  knife  kubernetes  lambda  languages  laptops  latency  legacy  leveldb  lhm  libc  librato  lifespan  limits  linden  linkedin  links  linode  linux  listen-backlog  live  load  load-balancers  load-balancing  load-testing  locking  logentries  logging  loggly  loose-coupling  lsb  lsof  lsx  luks  lxc  mac  machine-learning  macosx  madvise  mail  maintainance  mandos  manta  map-reduce  mapreduce  measurements  mechanical-sympathy  memory  mesos  metrics  mfa  microservices  microsoft  migration  migrations  mincore  mirroring  mit  ml  mmap  mongodb  monit  monitorama  monitoring  movies  mozilla  mtbf  multiplexing  mysql  nagios  namespaces  nannies  nas  natwest  nerve  netflix  netstat  netty  network  network-monitoring  network-partitions  networking  networks  new-relic  nginx  nix  nixos  nixpkgs  node.js  nosql  notification  notifications  npm  ntp  ntpd  nuclear-power  nurse  obama  omega  omniti  oom  oom-killer  open-source  openjdk  operability  operations  ops  optimization  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outbrain  outsourcing  overlayfs  owasp  packaging  packet-capture  packets  page-cache  pager-duty  pagerduty  pages  paging  papers  parse  partition  partitions  passenger  patterns  paxos  pbailis  pcp  pcp2graphite  pdf  peering  percona  performance  phusion  pie  pillar  pinball  pinterest  piops  pipelines  pixar  pki  platform  platforms  post-mortems  postgres  postmortem  postmortems  presentations  pricing  princess  prioritisation  procedures  process  processes  procfs  prod  production  profiling  programming  provisioning  proxies  proxy  proxying  pty  puppet  pv  python  qa  qdisc  questions  queueing  rabbitmq  race-conditions  rafe-colburn  raid  rails  rami-rosen  randomization  ranking  rant  rate-limiting  rbs  rc3  rdbms  rds  real-time  recovery  red-hat  reddit  redis  redshift  refactoring  reference  registry  regression-testing  reinvent  release  releases  reliability  reliabilty  remediation  replicas  replication  request-routing  resiliency  resource-limits  restarting  restoring  reverse-proxy  reversibility  reviews  rewrites  riak  riemann  risks  rkt  rm-rf  rmi  rocket  rollback  root-cause  root-causes  route53  routing  rspec  ruby  runbooks  runit  runjop  rvm  rwasa  s3  s3funnel  s3ql  safety  sandboxing  sanity-checks  scala  scalability  scale  scaling  scheduler  scheduling  schema  scripts  sdd  sdn  seagate  search  secrets  security  segment  sensu  serf  serialization  server  servers  serverspec  service-discovery  service-metrics  service-registry  services  ses  sev1  severity  sharding  shippable  shodan  shopify  shorn-writes  silos  sjk  slashdot  sleep  slew  slides  smartstack  smoke-tests  smtp  snappy  snapshots  sns  soa  sockets  software  solaris  soundcloud  south-pole  space  spark  spdy  speculative-execution  spinnaker  split-brain  spot-instances  spotify  sql  square  sre  ssd  ssh  ssl  stack  stack-size  stackshare  staging  startup  statistics  stats  statsd  statsite  stephanie-dean  stepping  storage  storm  strace  stream-processing  streaming  stress-testing  strider  stripe  supervision  supervisord  support  survey  svctm  syadmin  symantec  synapse  sysadmin  sysadvent  sysdig  syslog  sysstat  system  system-testing  system-v  systemd  systems  tahoe-lafs  talks  tc  tcp  tcpcopy  tcpdump  tdd  teams  tech  tech-debt  techops  tee  telefonica  telemetry  terraform  testing  tests  thp  threadpools  threads  three-mile-island  throughput  thundering-herd  tier-one-support  tildeslash  time  time-machine  time-series  time-synchronization  timeouts  tips  tls  tools  top  tos  trace  tracer-requests  tracing  trading  traefik  training  transactional-updates  transparent-huge-pages  troubleshooting  tsd  tuning  turing-complete  twilio  twisted  twitter  two-factor-authentication  uat  ubuntu  ubuntu-core  ui  ulster-bank  ultradns  unicorn  unit-testing  unit-tests  unix  upgrades  upstart  uptime  urls  uselessd  usenix  user-submitted-code  ux  vagrant  vector  version-control  versioning  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:eoinbrazil  via:fanf  via:feylya  via:filippo  via:jk  via:kragen  via:lusis  via:marc  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  virtualisation  virtualization  visualisation  vividcortex  vm  vms  voldemort  vpc  web  web-services  webmail  weighting  whats-my-ip  wiki  wipac  work  workflows  x86_64  xen  xooglers  yahoo  yammer  yelp  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark: