jm + ops   368

Introducing Winston
'Event driven Diagnostic and Remediation Platform' -- aka 'runbooks as code'
runbooks  winston  netflix  remediation  outages  mttr  ops  devops 
23 days ago by jm
AWS Case Study: mytaxi
ECS, Docker, ELB, SQS, SNS, RDS, VPC, and spot instances. Pretty canonical setup these days...
The mytaxi app is also now able to predict daily and weekly spikes. In addition, it has gained the elasticity required to meet demand during special events. Herzberg describes a typical situation on New Year's Eve: “Shortly before midnight everyone needs a taxi to get to parties, and after midnight people want to go home. In past years we couldn't keep up with the demand this generated, which was around three and a half times as high as normal. In November 2015 we moved our Docker container architecture to Amazon ECS, and for the first time ever in December we were able to celebrate a new year in which our system could handle the huge number of requests without any crashes or interruptions—an accomplishment that we were extremely proud of. We had faced the biggest night on the calendar without any downtime.”
mytaxi  aws  ecs  docker  elb  sqs  sns  rds  vpc  spot-instances  ops  architecture 
24 days ago by jm
My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <rob@infinitepigeons.org>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
6 weeks ago by jm
Raintank investing in Graphite
paying Jason Dixon to work on it, improving the backend, possibly replacing the creaky Whisper format. great news!
graphite  metrics  monitoring  ops  open-source  grafana  raintank 
7 weeks ago by jm
USE Method: Linux Performance Checklist
Really late in bookmarking this, but has some up-to-date sample commandlines for sar, mpstat and iostat on linux
linux  sar  iostat  mpstat  cli  ops  sysadmin  performance  tuning  use  metrics 
8 weeks ago by jm
Squeezing blood from a stone: small-memory JVM techniques for microservice sidecars
Reducing service memory usage from 500MB to 105MB:
We found two specific techniques to be the most beneficial: turning off one of the two JIT compilers enabled by default (the “C2” compiler), and using a 32-bit, rather than a 64-bit, JVM.
32bit  jvm  java  ops  memory  tuning  jit  linkerd 
10 weeks ago by jm
Some thoughts on operating containers
R.I.Pienaar talks about the conventions he uses when containerising; looks like a decent approach.
ops  containers  docker  ripienaar  packaging 
10 weeks ago by jm
Green/Blue Deployments with AWS Lambda and CloudFormation - done right
Basically, use a Lambda to put all instances from an ASG into the ELB, then remove the old ASG
asg  elb  aws  lambda  deployment  ops  blue-green-deploys 
may 2016 by jm
#825394 - systemd kill background processes after user logs out - Debian Bug report logs
Systemd breaks UNIX behaviour which has been standard practice for 30 years:
It is now indeed the case that any background processes that were still
running are killed automatically when the user logs out of a session,
whether it was a desktop session, a VT session, or when you SSHed into a
machine. Now you can no longer expect a long running background processes to
continue after logging out. I believe this breaks the expectations of
many users. For example, you can no longer start a screen or tmux
session, log out, and expect to come back to it.
systemd  ops  debian  linux  fail  background  cli  commandline 
may 2016 by jm
3 Reasons AWS Lambda Is Not Ready for Prime Time
This totally matches my own preconceptions ;)
When we at Datawire tried to actually use Lambda for a real-world HTTP-based microservice [...], we found some uncool things that make Lambda not yet ready for the world we live in:

Lambda is a building block, not a tool;
Lambda is not well documented;
Lambda is terrible at error handling

Lung skips these uncool things, which makes sense because they’d make the tutorial collapse under its own weight, but you can’t skip them if you want to work in the real world. (Note that if you’re using Lambda for event handling within the AWS world, your life will be easier. But the really interesting case in the microservice world is Lambda and HTTP.)
aws  lambda  microservices  datawire  http  api-gateway  apis  https  python  ops 
may 2016 by jm
Key Metrics for Amazon Aurora | AWS Partner Network (APN) Blog
Very DataDog-oriented, but some decent tips on monitorable metrics here
datadog  metrics  aurora  aws  rds  monitoring  ops 
may 2016 by jm
raboof/nethogs: Linux 'net top' tool
NetHogs is a small 'net top' tool. Instead of breaking the traffic down per protocol or per subnet, like most tools do, it groups bandwidth by process.
nethogs  cli  networking  performance  measurement  ops  linux  top 
may 2016 by jm
CoreOS and Prometheus: Building monitoring for the next generation of cluster infrastructure
Ooh, this is a great plan. :applause:
Enabling GIFEE — Google Infrastructure for Everyone Else — is a primary mission at CoreOS, and open source is key to that goal. [....]

Prometheus was initially created to handle monitoring and alerting in modern microservice architectures. It steadily grew to fit the wider idea of cloud native infrastructure. Though it was not intentional in the original design, Prometheus and Kubernetes conveniently share the key concept of identifying entities by labels, making the semantics of monitoring Kubernetes clusters simple. As we discussed previously on this blog, Prometheus metrics formed the basis of our analysis of Kubernetes scheduler performance, and led directly to improvements in that code. Metrics are essential not just to keep systems running, but also to analyze and improve application behavior.

All things considered, Prometheus was an obvious choice for the next open source project CoreOS wanted to support and improve with internal developers committed to the code base.
monitoring  coreos  prometheus  metrics  clustering  ops  gifee  google  kubernetes 
may 2016 by jm
Amazon S3 Transfer Acceleration
The AWS edge network has points of presence in more than 50 locations. Today, it is used to distribute content via Amazon CloudFront and to provide rapid responses to DNS queries made to Amazon Route 53. With today’s announcement, the edge network also helps to accelerate data transfers in to and out of Amazon S3. It will be of particular benefit to you if you are transferring data across or between continents, have a fast Internet connection, use large objects, or have a lot of content to upload.

You can think of the edge network as a bridge between your upload point (your desktop or your on-premises data center) and the target bucket. After you enable this feature for a bucket (by checking a checkbox in the AWS Management Console), you simply change the bucket’s endpoint to the form BUCKET_NAME.s3-accelerate.amazonaws.com. No other configuration changes are necessary! After you do this, your TCP connections will be routed to the best AWS edge location based on latency.  Transfer Acceleration will then send your uploads back to S3 over the AWS-managed backbone network using optimized network protocols, persistent connections from edge to origin, fully-open send and receive windows, and so forth.
aws  s3  networking  infrastructure  ops  internet  cdn 
april 2016 by jm
Google Cloud Status
Ouch, multi-region outage:
At 14:50 Pacific Time on April 11th, our engineers removed an unused GCE IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network. By itself, this sort of change was harmless and had been performed previously without incident. However, on this occasion our network configuration management software detected an inconsistency in the newly supplied configuration. The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management. In attempting to resolve this inconsistency the network management software is designed to ‘fail safe’ and revert to its current configuration rather than proceeding with the new configuration. However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.

One of our core principles at Google is ‘defense in depth’, and Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations in the event of an upstream failure or bug. These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
multi-region  outages  google  ops  postmortems  gce  cloud  ip  networking  cascading-failures  bugs 
april 2016 by jm
Open Sourcing Dr. Elephant: Self-Serve Performance Tuning for Hadoop and Spark
[LinkedIn] are proud to announce today that we are open sourcing Dr. Elephant, a powerful tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows.


neat, although I've been bitten too many times by LinkedIn OSS release quality at this point to jump in....
linkedin  oss  hadoop  spark  performance  tuning  ops 
april 2016 by jm
AWSume
'AWS Assume Made Awesome' -- 'Here are Trek10, we work with many clients, and thus work with multiple AWS accounts on a regular (daily) basis. We needed a way to make managing all our different accounts easier. We create a standard Trek10 administrator role in our clients’ accounts that we can assume. For security we require that the role assumer have multifactor authentication enabled.'
mfa  aws  awsume  credentials  accounts  ops 
april 2016 by jm
Dan Luu reviews the Site Reliability Engineering book
voluminous! still looks great, looking forward to reading our copy (via Tony Finch)
via:fanf  books  reading  devops  ops  google  sre  dan-luu 
april 2016 by jm
s3git
git for Cloud Storage. Create distributed, decentralized and versioned repositories that scale infinitely to 100s of millions of files and PBs of storage. Huge repos can be cloned on your local SSD for making changes, committing and pushing back. Oh yeah, and it dedupes too due to BLAKE2 Tree hashing. http://s3git.org
git  ops  storage  cloud  s3  disk  aws  version-control  blake2 
april 2016 by jm
The revenge of the listening sockets
More adventures in debugging the Linux kernel:
You can't have a very large number of bound TCP sockets and we learned that the hard way. We learned a bit about the Linux networking stack: the fact that LHTABLE is fixed size and is hashed by destination port only. Once again we showed a couple of powerful of System Tap scripts.
ops  linux  networking  tcp  network  lhtable  kernel 
april 2016 by jm
Wired on the new O'Reilly SRE book
"Site Reliability Engineering: How Google Runs Production Systems", by Chris Jones, Betsy Beyer, Niall Richard Murphy, Jennifer Petoff. Go Niall!
google  sre  niall-murphy  ops  devops  oreilly  books  toread  reviews 
april 2016 by jm
Counting with domain specific databases — The Smyte Blog — Medium
whoa, pretty heavily engineered scalable counting system with Kafka, RocksDB and Kubernetes
kafka  rocksdb  kubernetes  counting  databases  storage  ops 
april 2016 by jm
A Decade Of Container Control At Google
The big thing that can be gleaned from the latest paper out of Google on its container controllers is that the shift from bare metal to containers is a profound one – something that may not be obvious to everyone seeking containers as a better way – and we think cheaper way – of doing server virtualization and driving up server utilization higher. Everything becomes application-centric rather than machine-centric, which is the nirvana that IT shops have been searching for. The workload schedulers, cluster managers, and container controllers work together to get the right capacity to the application when it needs it, whether it is a latency-sensitive job or a batch job that has some slack in it, and all that the site recovery engineers and developers care about is how the application is performing and they can easily see that because all of the APIs and metrics coming out of them collect data at the application level, not on a per-machine basis. To do this means adopting containers, period. There is no bare metal at Google, and let that be a lesson to HPC shops or other hyperscalers or cloud builders that think they need to run in bare metal mode.
google  containers  kubernetes  borg  bare-metal  ops 
april 2016 by jm
bcc
Dynamic tracing tools for Linux, a la dtrace, ktrace, etc. Built using BPF, using kernel features in the 4.x kernel series, requiring at least version 4.1 of the kernel
linux  tracing  bpf  dynamic  ops 
april 2016 by jm
Qualys SSL Server Test
pretty sure I had this bookmarked previously, but this is the current URL -- SSL/TLS quality report
ssl  tls  security  tests  ops  tools  testing 
march 2016 by jm
Charity Majors - AWS networking, VPC, environments and you
'VPC is the future and it is awesome, and unless you have some VERY SPECIFIC AND CONVINCING reasons to do otherwise, you should be spinning up a VPC per environment with orchestration and prob doing it from CI on every code commit, almost like it’s just like, you know, code.'
networking  ops  vpc  aws  environments  stacks  terraform 
march 2016 by jm
Ruby in Production: Lessons Learned — Medium
Based on the pain we've had trying to bring our Rails services up to the quality levels required, this looks pretty accurate in many respects. I'd augment this advice by saying: avoid RVM; use Docker.
rvm  docker  ruby  production  rails  ops 
march 2016 by jm
Seesaw: scalable and robust load balancing from Google
After evaluating a number of platforms, including existing open source projects, we were unable to find one that met all of our needs and decided to set about developing a robust and scalable load balancing platform. The requirements were not exactly complex - we needed the ability to handle traffic for unicast and anycast VIPs, perform load balancing with NAT and DSR (also known as DR), and perform adequate health checks against the backends. Above all we wanted a platform that allowed for ease of management, including automated deployment of configuration changes.

One of the two existing platforms was built upon Linux LVS, which provided the necessary load balancing at the network level. This was known to work successfully and we opted to retain this for the new platform. Several design decisions were made early on in the project — the first of these was to use the Go programming language, since it provided an incredibly powerful way to implement concurrency (goroutines and channels), along with easy interprocess communication (net/rpc). The second was to implement a modular multi-process architecture. The third was to simply abort and terminate a process if we ended up in an unknown state, which would ideally allow for failover and/or self-recovery.
seesaw  load-balancers  google  load-balancing  vips  anycast  nat  lbs  go  ops  networking 
january 2016 by jm
AWS Certificate Manager – Deploy SSL/TLS-Based Apps on AWS
Very nifty -- autodeploys free wildcard certs to ELBs and Cloudfront. HN discussion thread is pretty good: https://news.ycombinator.com/item?id=10947186
ssl  tls  certificates  ops  aws  cloudfront  elb 
january 2016 by jm
About Microservices, Containers and their Underestimated Impact on Network Performance
shock horror, Docker-SDN layers have terrible performance. Still pretty lousy perf impacts from basic Docker containerization, presumably without "--net=host" (which is apparently vital)
docker  performance  network  containers  sdn  ops  networking  microservices 
january 2016 by jm
Jepsen: RethinkDB 2.1.5
A good review of RethinkDB! Hopefully not just because this test is contract work on behalf of the RethinkDB team ;)
I’ve run hundreds of test against RethinkDB at majority/majority, at various timescales, request rates, concurrencies, and with different types of failures. Consistent with the documentation, I have never found a linearization failure with these settings. If you use hard durability, majority writes, and majority reads, single-document ops in RethinkDB appear safe.
rethinkdb  databases  stores  storage  ops  availability  cap  jepsen  tests  replication 
january 2016 by jm
BBC Digital Media Distribution: How we improved throughput by 4x
Replacing varnish with nginx. Nice deep-dive blog post covering kernel innards
nginx  performance  varnish  web  http  bbc  ops 
january 2016 by jm
How Completely Messed Up Practices Become Normal
on Normalization of Deviance, with a few anecdotes from Silicon Valley. “The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization.”
normalization-of-deviance  deviance  bugs  culture  ops  reliability  work  workplaces  processes  norms 
december 2015 by jm
2016 Wish List for AWS?
good thread of AWS' shortcomings -- so many services still don't handle VPC for instance
vpc  aws  ec2  ops  wishlist 
december 2015 by jm
Amazon EC2 Container Registry
hooray, Docker registry here at last
ecs  docker  registry  ops  containers  aws 
december 2015 by jm
Why We Chose Kubernetes Over ECS
3 months ago when we, at nanit.com, came to evaluate which Docker orchestration framework to use, we gave ECS the first priority. We were already familiar with AWS services, and since we already had our whole infrastructure there, it was the default choice. After testing the service for a while we had the feeling it was not mature enough and missing some key features we needed (more on that later), so we went to test another orchestration framework: Kubernetes. We were glad to discover that Kubernetes is far more comprehensive and had almost all the features we required. For us, Kubernetes won ECS on ECS’s home court, which is AWS.
kubernetes  ecs  docker  containers  aws  ec2  ops 
december 2015 by jm
AWS Api Gateway for Fun and Profit
good worked-through example of an API Gateway rewriting system
api-gateway  aws  api  http  services  ops  alerting  alarming  opsgenie  signalfx 
december 2015 by jm
Low-latency journalling file write latency on Linux
great research from LMAX: xfs/ext4 are the best choices, and they explain why in detail, referring to the code
linux  xfs  ext3  ext4  filesystems  lmax  performance  latency  journalling  ops 
december 2015 by jm
"Hidden Technical Debt in Machine-Learning Systems" [pdf]
Another great paper about from Google, talking about the tradeoffs that must be considered in practice over the long term with running a complex ML system in production.
technical-debt  ml  machine-learning  ops  software  production  papers  pdf  google 
december 2015 by jm
A Gulp Workflow for Amazon Lambda
'any nontrivial development of Lambda functions will require a simple, automated build/deploy process that also fills a couple of Lambda’s gaps such as the use of node modules and environment variables.'

See also https://medium.com/@AdamRNeary/developing-and-testing-amazon-lambda-functions-e590fac85df4#.mz0a4qk3j : 'I am psyched about Amazon’s new Lambda service for asynchronous task processing, but the ideal development and testing cycle is really left to the engineer. While Amazon provides a web-based console, I prefer an approach that uses Mocha. Below you will find the gritty details using Kinesis events as a sample input.'
lambda  aws  services  testing  deployment  ops  mocha  gulp  javascript 
december 2015 by jm
Global Continuous Delivery with Spinnaker
Netflix' CD platform, post-Atlas. looks interesting
continuous-delivery  aws  netflix  cd  devops  ops  atlas  spinnaker 
november 2015 by jm
The impact of Docker containers on the performance of genomic pipelines [PeerJ]
In this paper, we have assessed the impact of Docker containers technology on the performance of genomic pipelines, showing that container “virtualization” has a negligible overhead on pipeline performance when it is composed of medium/long running tasks, which is the most common scenario in computational genomic pipelines.

Interestingly for these tasks the observed standard deviation is smaller when running with Docker. This suggests that the execution with containers is more “homogeneous,” presumably due to the isolation provided by the container environment.

The performance degradation is more significant for pipelines where most of the tasks have a fine or very fine granularity (a few seconds or milliseconds). In this case, the container instantiation time, though small, cannot be ignored and produces a perceptible loss of performance.
performance  docker  ops  genomics  papers 
november 2015 by jm
Alarm design: From nuclear power to WebOps
Imagine you are an operator in a nuclear power control room. An accident has started to unfold. During the first few minutes, more than 100 alarms go off, and there is no system for suppressing the unimportant signals so that you can concentrate on the significant alarms. Information is not presented clearly; for example, although the pressure and temperature within the reactor coolant system are shown, there is no direct indication that the combination of pressure and temperature mean that the cooling water is turning into steam. There are over 50 alarms lit in the control room, and the computer printer registering alarms is running more than 2 hours behind the events.

This was the basic scenario facing the control room operators during the Three Mile Island (TMI) partial nuclear meltdown in 1979. The Report of the President’s Commission stated that, “Overall, little attention had been paid to the interaction between human beings and machines under the rapidly changing and confusing circumstances of an accident” (p. 11). The TMI control room operator on the day, Craig Faust, recalled for the Commission his reaction to the incessant alarms: “I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information”. It was the first major illustration of the alarm problem, and the accident triggered a flurry of human factors/ergonomics (HF/E) activity.


A familiar topic for this ex-member of the Amazon network monitoring team...
ergonomics  human-factors  ui  ux  alarms  alerts  alerting  three-mile-island  nuclear-power  safety  outages  ops 
november 2015 by jm
Dynalite
Awesome new mock DynamoDB implementation:
An implementation of Amazon's DynamoDB, focussed on correctness and performance, and built on LevelDB (well, @rvagg's awesome LevelUP to be precise). This project aims to match the live DynamoDB instances as closely as possible (and is tested against them in various regions), including all limits and error messages.

Why not Amazon's DynamoDB Local? Because it's too buggy! And it differs too much from the live instances in a number of key areas.


We use DynamoDBLocal in our tests -- the availability of that tool is one of the key reasons we have adopted Dynamo so heavily, since we can safely test our code properly with it. This looks even better.
dynamodb  testing  unit-tests  integration-testing  tests  ops  dynalite  aws  leveldb 
november 2015 by jm
Nobody Loves Graphite Anymore - VividCortex
Graphite has a place in our current monitoring stack, and together with StatsD will always have a special place in the hearts of DevOps practitioners everywhere, but it’s not representative of state-of-the-art in the last few years. Graphite is where the puck was in 2010. If you’re skating there, you’re missing the benefits of modern monitoring infrastructure.

The future I foresee is one where time series capabilities (the raw power needed, which I described in my time series requirements blog post, for example) are within everyone’s reach. That will be considered table stakes, whereas now it’s pretty revolutionary.


Like I've been saying -- we need Time Series As A Service! This should be undifferentiated heavy lifting.
graphite  tsd  time-series  vividcortex  statsd  ops  monitoring  metrics 
november 2015 by jm
How Facebook avoids failures
Great paper from Ben Maurer of Facebook in ACM Queue.
A "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves.


This is full of interesting techniques.

* Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert.

* Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills.

* Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm -- this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker).

* Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed?

* Learning from Failure: the DERP (!) methodology,
ben-maurer  facebook  reliability  algorithms  codel  circuit-breakers  derp  failure  ops  cubism  horizon-charts  charts  dependencies  soa  microservices  uptime  deployment  configuration  change-management 
november 2015 by jm
Structural and semantic deficiencies in the systemd architecture for real-world service management, a technical treatise
Despite its overarching abstractions, it is semantically non-uniform and its complicated transaction and job scheduling heuristics ordered around a dependently networked object system create pathological failure cases with little debugging context that would otherwise not necessarily occur on systems with less layers of indirection. The use of bus APIs complicate communication with the service manager and lead to duplication of the object model for little gain. Further, the unit file options often carry implicit state or are not sufficiently expressive. There is an imbalance with regards to features of an eager service manager and that of a lazy loading service manager, having rusty edge cases of both with non-generic, manager-specific facilities. The approach to logging and the circularly dependent architecture seem to imply that lots of prior art has been ignored or understudied.
analysis  systemd  linux  unix  ops  init  critiques  software  logging 
november 2015 by jm
Google Cloud Platform HTTP/HTTPS Load Balancing
GCE's LB product is pretty nice -- HTTP/2 support, and a built-in URL mapping feature (presumably based on how Google approach that problem internally, I understand they take that approach). I'm hoping AWS are taking notes for the next generation of ELB, if that ever happens
elb  gce  google  load-balancing  http  https  spdy  http2  urls  request-routing  ops  architecture  cloud 
october 2015 by jm
Google tears Symantec a new one on its CA failure
Symantec are getting a crash course in how to conduct an incident post-mortem to boot:
More immediately, we are requesting of Symantec that they further update their public incident report with:
A post-mortem analysis that details why they did not detect the additional certificates that we found.
Details of each of the failures to uphold the relevant Baseline Requirements and EV Guidelines and what they believe the individual root cause was for each failure.
We are also requesting that Symantec provide us with a detailed set of steps they will take to correct and prevent each of the identified failures, as well as a timeline for when they expect to complete such work. Symantec may consider this latter information to be confidential and so we are not requesting that this be made public.
google  symantec  ev  ssl  certificates  ca  security  postmortems  ops 
october 2015 by jm
Holistic Configuration Management at Facebook
How FB push config changes from Git (where it is code reviewed, version controlled, and history tracked with strong auth) to Zeus (their Zookeeper fork) and from there to live production servers.
facebook  configuration  zookeeper  git  ops  architecture 
october 2015 by jm
(ARC308) The Serverless Company: Using AWS Lambda
Describing PlayOn! Sports' Lambda setup. Sounds pretty productionizable
ops  lambda  aws  reinvent  slides  architecture 
october 2015 by jm
Designing the Spotify perimeter
How Spotify use nginx as a frontline for their sites and services
scaling  spotify  nginx  ops  architecture  ssl  tls  http  frontline  security 
october 2015 by jm
AWS re:Invent 2015 | (CMP406) Amazon ECS at Coursera - YouTube
Coursera are running user-submitted code in ECS! interesting stuff about how they use Docker security/resource-limiting features, forking the ecs-agent code, to run user-submitted code. :O
coursera  user-submitted-code  sandboxing  docker  security  ecs  aws  resource-limits  ops 
october 2015 by jm
SuperChief: From Apache Storm to In-House Distributed Stream Processing
Another sorry tale of Storm issues:
Storm has been successful at Librato, but we experienced many of the limitations cited in the Twitter Heron: Stream Processing at Scale paper and outlined here by Adrian Colyer, including:
Inability to isolate, reason about, or debug performance issues due to the worker/executor/task paradigm. This led to building and configuring clusters specifically designed to attempt to mitigate these problems (i.e., separate clusters per topology, only running a worker per server.), which added additional complexity to development and operations and also led to over-provisioning.
Ability of tasks to move around led to difficult to trace performance problems.
Storm’s work provisioning logic led to some tasks serving more Kafka partitions than others. This in turn created latency and performance issues that were difficult to reason about. The initial solution was to over-provision in an attempt to get a better hashing/balancing of work, but eventually we just replaced the work allocation logic.
Due to Storm’s architecture, it was very difficult to get a stack trace or heap dump because the processes that managed workers (Storm supervisor) would often forcefully kill a Java process while it was being investigated in this way.
The propensity for unexpected and subsequently unhandled exceptions to take down an entire worker led to additional defensive verbose error handling everywhere.
This nasty bug STORM-404 coupled with the aforementioned fact that a single exception can take down a worker led to several cascading failures in production, taking down entire topologies until we upgraded to 0.9.4.
Additionally, we found the performance we were getting from Storm for the amount of money we were spending on infrastructure was not in line with our expectations. Much of this is due to the fact that, depending upon how your topology is designed, a single tuple may make multiple hops across JVMs, and this is very expensive. For example, in our time series aggregation topologies a single tuple may be serialized/deserialized and shipped across the wire 3-4 times as it progresses through the processing pipeline.
scalability  storm  kafka  librato  architecture  heron  ops 
october 2015 by jm
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
There was a breakdown in communication between the developer who requested the index migration and the database operator who deleted the old index. Instead of working on the migration together, they communicated in an implicit way through flawed tooling. The dashboard that surfaced the migration request was missing important context: the reason for the requested deletion, the dependency on another index’s creation, and the criticality of the index for API traffic. Indeed, the database operator didn’t have a way to check whether the index had recently been used for a query.


Good demo of how the Etsy-style chatops deployment approach would have helped avoid this risk.
stripe  postmortem  outages  databases  indexes  deployment  chatops  deploy  ops 
october 2015 by jm
How IFTTT develop with Docker
ugh, quite a bit of complexity here
docker  osx  dev  ops  building  coding  ifttt  dns  dnsmasq 
october 2015 by jm
The Totally Managed Analytics Pipeline: Segment, Lambda, and Dynamo
notable mainly for the details of Terraform support for Lambda: that's a significant improvement to Lambda's production-readiness
aws  pipelines  data  streaming  lambda  dynamodb  analytics  terraform  ops 
october 2015 by jm
Rebuilding Our Infrastructure with Docker, ECS, and Terraform
Good writeup of current best practices for a production AWS architecture
aws  ops  docker  ecs  ec2  prod  terraform  segment  via:marc 
october 2015 by jm
Elasticsearch and data loss
"@alexbfree @ThijsFeryn [ElasticSearch is] fine as long as data loss is acceptable. https://aphyr.com/posts/317-call-me-maybe-elasticsearch . We lose ~1% of all writes on average."
elasticsearch  data-loss  reliability  data  search  aphyr  jepsen  testing  distributed-systems  ops 
october 2015 by jm
traefik
Træfɪk is a modern HTTP reverse proxy and load balancer made to deploy microservices with ease. It supports several backends (Docker , Mesos/Marathon, Consul, Etcd, Rest API, file...) to manage its configuration automatically and dynamically.


Hot-reloading is notably much easier than with nginx/haproxy.
proxy  http  proxying  reverse-proxy  traefik  go  ops 
september 2015 by jm
Chaos Engineering Upgraded
some details on Netflix's Chaos Monkey, Chaos Kong and other aspects of their availability/failover testing
architecture  aws  netflix  ops  chaos-monkey  chaos-kong  testing  availability  failover  ha 
september 2015 by jm
Byteman
a tool which simplifies tracing and testing of Java programs. Byteman allows you to insert extra Java code into your application, either as it is loaded during JVM startup or even after it has already started running. The injected code is allowed to access any of your data and call any application methods, including where they are private. You can inject code almost anywhere you want and there is no need to prepare the original source code in advance nor do you have to recompile, repackage or redeploy your application. In fact you can remove injected code and reinstall different code while the application continues to execute. The simplest use of Byteman is to install code which traces what your application is doing. This can be used for monitoring or debugging live deployments as well as for instrumenting code under test so that you can be sure it has operated correctly. By injecting code at very specific locations you can avoid the overheads which often arise when you switch on debug or product trace. Also, you decide what to trace when you run your application rather than when you write it so you don't need 100% hindsight to be able to obtain the information you need.
tracing  java  byteman  injection  jvm  ops  debugging  testing 
september 2015 by jm
httpry
a specialized packet sniffer designed for displaying and logging HTTP traffic. It is not intended to perform analysis itself, but to capture, parse, and log the traffic for later analysis. It can be run in real-time displaying the traffic as it is parsed, or as a daemon process that logs to an output file. It is written to be as lightweight and flexible as possible, so that it can be easily adaptable to different applications.


via Eoin Brazil
via:eoinbrazil  httpry  http  networking  tools  ops  testing  tcpdump  tracing 
september 2015 by jm
Anatomy of a Modern Production Stack
Interesting post, but I think it falls into a common trap for the xoogler or ex-Amazonian -- assuming that all the BigCo mod cons are required to operate, when some are luxuries than can be skipped for a few years to get some real products built
architecture  ops  stack  docker  containerization  deployment  containers  rkt  coreos  prod  monitoring  xooglers 
september 2015 by jm
You're probably wrong about caching
Excellent cut-out-and-keep guide to why you should add a caching layer. I've been following this practice for the past few years, after I realised that #6 (recovering from a failed cache is hard) is a killer -- I've seen a few large-scale outages where a production system had gained enough scale that it required a cache to operate, and once that cache was damaged, bringing the system back online required a painful rewarming protocol. Better to design for the non-cached case if possible.
architecture  caching  coding  design  caches  ops  production  scalability 
september 2015 by jm
Docker image creation, tagging and traceability in Shippable
this is starting to look quite impressive as a well-integrated Docker-meets-CI model; Shippable is basing its builds off Docker baselines and is automatically cutting Docker images of the post-CI stage. Must take another look
shippable  docker  ci  ops  dev  continuous-integration 
august 2015 by jm
Call me Maybe: Chronos
Chronos (the Mesos distributed scheduler) comes out looking pretty crappy here
aphyr  mesos  chronos  cron  scheduling  outages  ops  jepsen  testing  partitions  cap 
august 2015 by jm
« earlier      
per page:    204080120160

related tags

10/8  32bit  accept  accidents  accounts  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  advice  agpl  airbnb  airflow  aix  alarm-fatigue  alarming  alarms  alert-logic  alerting  alerts  alestic  algorithms  allspaw  alter-table  ama  amazon  analysis  analytics  anomaly-detection  antarctica  anti-spam  antipatterns  anycast  ap  apache  aphyr  api  api-gateway  apis  app-engine  apt  archaius  architecture  asg  asgard  aspirations  assembly  atlas  atomic  aufs  aurora  authentication  auto-remediation  auto-scaling  automation  autoremediation  autoscaling  availability  aws  awsume  az  azure  backblaze  background  backlog  backpressure  backup  backups  banking  bare-metal  baron-schwartz  bash  basho  batch  bbc  bdb  bdb-je  bdd  beanstalk  ben-maurer  ben-treynor  benchmarking  benchmarks  best-practices  big-data  billing  bind  bit-errors  bitcoin  bitly  bitrot  blake2  bloat  blockdev  blogs  blue-green-deployments  blue-green-deploys  books  boot2docker  borg  boundary  bpf  broadcast  bryan-cantrill  bsd  btrfs  bugs  build  build-out  building  bureaucracy  byteman  c  ca  ca-7  caches  caching  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  cascading-failures  case-studies  cassandra  cd  cdn  censum  certificates  certs  cfengine  cgroups  change-management  change-monitoring  changes  chaos-kong  chaos-monkey  charts  chatops  checkip  checklists  chef  chefspec  china  chronos  ci  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cleanup  cli  clos-networks  cloud  cloud-storage  cloudera  cloudflare  cloudfront  cloudnative  cloudwatch  cluster  clustering  clusters  cms  code-spaces  codeascraft  codedeploy  codel  coding  coes  coinbase  cold  collaboration  command-line  commandline  commercial  company  compatibility  complexity  compression  concurrency  conferences  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy-on-write  copyright  coreos  coreutils  corruption  costs  counting  coursera  cp  crash-only-software  credentials  critiques  criu  cron  crypto  cubism  culture  curl  daemon  daemons  dan-luu  danilop  dark-releases  dashboards  data  data-centers  data-corruption  data-loss  database  database-is-not-a-queue  databases  datacenters  datadog  dataviz  datawire  dbus  debian  debug  debugging  decay  defrag  delete  delivery  delta  demo  dependencies  deploy  deployinator  deployment  derp  design  desktops  dev  developers  development  deviance  devops  diagnosis  digital-ocean  disk  disk-space  disks  distcomp  distributed  distributed-cron  distributed-systems  distros  diy  dmca  dns  dnsmasq  docker  documentation  dotcloud  drivers  dropbox  dstat  duplicity  duply  dynalite  dynamic  dynamic-configuration  dynamodb  dynect  ebs  ec2  ecs  elastic-scaling  elasticsearch  elb  email  emr  emrfs  encryption  engineering  ensemble  environments  erasure-coding  ergonomics  etcd  etsy  eureka  ev  event-management  eventual-consistency  exception-handling  exercises  exponential-decay  ext3  ext4  extortion  fabric  facebook  facette  fail  failover  failure  false-positives  fault-tolerance  fcron  feature-flags  fedora  file-transfer  filesystems  fincore  firefighting  five-whys  flapjack  flavour-of-the-month  flock  flow-logs  forecasting  foursquare  freebsd  front-ends  frontline  fs  fsync  ftrace  fuse  g1  g1gc  gae  game-days  games  gating  gc  gce  gcp  genomics  gifee  gil-tene  gilt  gilt-groupe  git  github  gnome  go  god  google  gossip  grafana  graphing  graphite  graphs  gruffalo  gulp  gzip  ha  hacks  hadoop  hailo  haproxy  hardware  hbase  hdds  hdfs  heap  heartbeats  heka  hero-coder  hero-culture  heron  hiccups  hidden-costs  history  hn  holt-winters  home  honeypot  horizon-charts  hosting  hotspot  howto  hrd  http  http2  httpry  https  huge-pages  human-factors  humor  hvm  hyperthreading  hystrix  iam  ian-wilkes  ibm  icecube  ifttt  images  imaging  inactivity  incident-response  incidents  indexes  inept  influxdb  infrastructure  init  injection  inspeqtor  instrumentation  integration-testing  integration-tests  internet  internet-scale  interviews  inviso  io  iops  iostat  ioutil  ip  ip-addresses  iptables  ironfan  james-hamilton  java  javascript  jay-kreps  jcmd  jdk  jemalloc  jenkins  jepsen  jit  jmx  jmxtrans  john-allspaw  journalling  joyent  jstat  juniper  jvm  kafka  kdd  kde  kellabyte  kernel  key-distribution  key-rotation  keybox  keys  keywhiz  kill-9  knife  kubernetes  lambda  languages  laptops  latency  lbs  legacy  leveldb  lhm  lhtable  libc  librato  lifespan  limits  linden  linkedin  linkerd  links  linode  linux  listen-backlog  live  lmax  load  load-balancers  load-balancing  load-testing  locking  logentries  logging  loggly  loose-coupling  lsb  lsof  lsx  luks  lxc  mac  machine-learning  macosx  madvise  mail  maintainance  mandos  manta  map-reduce  mapreduce  measurement  measurements  mechanical-sympathy  memory  mesos  metrics  mfa  microservices  microsoft  migration  migrations  mincore  mirroring  mit  ml  mmap  mocha  money  mongodb  monit  monitorama  monitoring  movies  mozilla  mpstat  mtbf  mttr  multi-region  multiplexing  mysql  mytaxi  nagios  namespaces  nannies  nas  nat  natwest  nerve  netdata  netflix  nethogs  netstat  netty  network  network-monitoring  network-partitions  networking  networks  new-relic  nginx  niall-murphy  nix  nixos  nixpkgs  node.js  normalization-of-deviance  norms  nosql  notification  notifications  npm  ntp  ntpd  nuclear-power  nurse  obama  omega  omniti  oom  oom-killer  open-source  openjdk  operability  operations  ops  opsgenie  optimization  oreilly  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outbrain  outsourcing  overlayfs  owasp  packaging  packet-capture  packets  page-cache  pager  pager-duty  pagerduty  pages  paging  papers  parse  partition  partitions  passenger  patterns  paxos  pbailis  pcp  pcp2graphite  pdf  peering  percona  performance  phusion  pie  pillar  pinball  pinterest  piops  pipelines  pixar  pki  platform  platforms  plumbr.eu  post-mortems  postgres  postmortem  postmortems  presentations  pricing  princess  prioritisation  procedures  process  processes  procfs  prod  production  profiling  programming  prometheus  provisioning  proxies  proxy  proxying  pty  puppet  pv  python  qa  qdisc  questions  queueing  rabbitmq  race-conditions  rafe-colburn  raid  rails  raintank  rami-rosen  randomization  ranking  rant  rate-limiting  rbs  rc3  rdbms  rds  reading  real-time  recovery  red-hat  reddit  redis  redshift  refactoring  reference  registry  regression-testing  reinvent  release  releases  reliability  reliabilty  remediation  replicas  replication  request-routing  resiliency  resource-limits  restarting  restoring  rethinkdb  reverse-proxy  reversibility  reviews  rewrites  riak  riemann  ripienaar  risks  rkt  rm-rf  rmi  rob-ewaschuk  rocket  rocksdb  rollback  root-cause  root-causes  route53  routing  rspec  ruby  runbooks  runit  runjop  rvm  rwasa  s3  s3funnel  s3ql  safety  sandboxing  sanity-checks  sar  scala  scalability  scale  scaling  scheduler  scheduling  schema  scripts  sdd  sdn  seagate  search  secrets  security  seesaw  segment  sensu  serf  serialization  server  servers  serverspec  service-discovery  service-metrics  service-registry  services  ses  sev1  severity  sharding  shippable  shodan  shopify  shorn-writes  signalfx  silos  sjk  slashdot  sleep  slew  slides  smartstack  smoke-tests  smtp  snappy  snapshots  sns  soa  sockets  software  solaris  soundcloud  south-pole  space  spark  spdy  speculative-execution  spinnaker  split-brain  spot-instances  spotify  sql  sqs  square  sre  ssd  ssh  ssl  stack  stack-size  stackoverflow  stacks  stackshare  staging  startup  statistics  stats  statsd  statsite  stephanie-dean  stepping  storage  stores  storm  strace  stream-processing  streaming  stress-testing  strider  stripe  supervision  supervisord  support  survey  svctm  syadmin  symantec  synapse  sysadmin  sysadvent  sysdig  syslog  sysstat  system  system-testing  system-v  systemd  systems  tahoe-lafs  talks  tc  tcp  tcpcopy  tcpdump  tdd  teams  tech  tech-debt  technical-debt  techops  tee  telefonica  telemetry  terraform  testing  tests  thp  threadpools  threads  three-mile-island  throughput  thundering-herd  tier-one-support  tildeslash  time  time-machine  time-series  time-synchronization  timeouts  tips  tls  tools  top  toread  tos  trace  tracer-requests  tracing  trading  traefik  training  transactional-updates  transparent-huge-pages  troubleshooting  tsd  tuning  turing-complete  twilio  twisted  twitter  two-factor-authentication  uat  ubuntu  ubuntu-core  ui  ulster-bank  ultradns  unicorn  unikernels  unit-testing  unit-tests  unix  upgrades  upstart  uptime  urls  use  uselessd  usenix  user-submitted-code  ux  vagrant  varnish  vector  version-control  versioning  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:eoinbrazil  via:fanf  via:feylya  via:filippo  via:jk  via:kragen  via:lusis  via:marc  via:markkenny  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  vips  virtualisation  virtualization  visualisation  vividcortex  vm  vms  voldemort  vpc  web  web-services  webmail  weighting  whats-my-ip  wiki  winston  wipac  wishlist  work  workflows  workplaces  x86_64  xen  xfs  xooglers  yahoo  yammer  yelp  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark:



description:


tags: