jm + ops   443

HN thread on the new Network Load Balancer AWS product
looks like @colmmacc works on it. Lots and lots of good details here
nlb  aws  load-balancing  ops  architecture  lbs  tcp  ip 
13 days ago by jm
Going Multi-Cloud with AWS and GCP: Lessons Learned at Scale
Metamarkets splits across AWS and GCP, going into heavy detail here
aws  gcp  google  ops  hosting  multi-cloud 
4 weeks ago by jm
Linux Load Averages: Solving the Mystery
Nice bit of OS archaeology by Brendan Gregg.
In 1993, a Linux engineer found a nonintuitive case with load averages, and with a three-line patch changed them forever from "CPU load averages" to what one might call "system load averages." His change included tasks in the uninterruptible state, so that load averages reflected demand for disk resources and not just CPUs. These system load averages count the number of threads working and waiting to work, and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves.
load  monitoring  linux  unix  performance  ops  brendan-gregg  history  cpu 
4 weeks ago by jm
Arq Backs Up To B2!
Arq backup for OSX now supports B2 (as well as S3) as a storage backend.
"it’s a super-cheap option ($.005/GB per month) for storing your backups." (that is less than half the price of $0.0125/GB for S3's Infrequent Access class)
s3  storage  b2  backblaze  backups  arq  macosx  ops 
5 weeks ago by jm
Working with multiple AWS accounts at Ticketea
AWS STS/multiple account best practice described
sts  aws  authz  ops  ticketea  dev 
5 weeks ago by jm
AWS Lambda Deployment using Terraform – Build ACL – Medium
Fairly persuasive that production usage of Lambda is much easier if you go full Terraform to manage and deploy.
A complete picture of what it takes to deploy your Lambda function to production with the same diligence you apply to any other codebase using Terraform. [...] There are many cases where frameworks such as SAM or Serverless are not enough. You need more than that for a highly integrated Lambda function. In such cases, it’s easier to simply use Terraform.
infrastructure  aws  lambda  serverless  ops  terraform  sam 
6 weeks ago by jm
Nextflow - A DSL for parallel and scalable computational pipelines
Data-driven computational pipelines

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.


GPLv3 licensed, open source
computation  workflows  pipelines  batch  docker  ops  open-source 
6 weeks ago by jm
EBS gp2 I/O BurstBalance exhaustion
when EBS volumes in EC2 exhaust their "burst" allocation, things go awry very quickly
performance  aws  ebs  ec2  burst-balance  ops  debugging 
8 weeks ago by jm
Kubernetes Best Practices // Speaker Deck
A lot of these are general Docker/containerisation best practices, too.

(via Devops Weekly)
k8s  kubernetes  devops  ops  containers  docker  best-practices  tips  packaging 
8 weeks ago by jm
awslabs/aws-ec2rescue-linux
Amazon Web Services Elastic Compute Cloud (EC2) Rescue for Linux is a python-based tool that allows for the automatic diagnosis of common problems found on EC2 Linux instances.


Most of the modules appear to be log-greppers looking for common kernel issues.
ec2  aws  kernel  linux  ec2rl  ops 
9 weeks ago by jm
Wifi AP Placement [video]
'AP Placement - A Job For the Work Experience Kid? | Scott Stapleton | WLPC EU Budapest 2016'
ap  wifi  placement  layout  ops  wireless  home  presos 
9 weeks ago by jm
OVH suffer 24-hour outage (The Register)
Choice quotes:

‘At 6:48pm, Thursday, June 29, in Room 3 of the P19 datacenter, due to a crack on a soft plastic pipe in our water-cooling system, a coolant leak causes fluid to enter the system';
‘This process had been tested in principle but not at a 50,000-website scale’
postmortems  ovh  outages  liquid-cooling  datacenters  dr  disaster-recovery  ops 
9 weeks ago by jm
Fastest syncing of S3 buckets
good tip for "aws s3 sync" performance
performance  aws  s3  copy  ops  tips 
10 weeks ago by jm
Scheduled Tasks (cron) - Amazon EC2 Container Service
ECS now does cron jobs. But where does AWS Batch fit in? confusing
aws  batch  ecs  cron  scheduling  recurrence  ops 
10 weeks ago by jm
Top 5 ways to improve your AWS EC2 performance
A couple of bits of excellent advice from Datadog (although this may be a slightly old post, from Oct 2016):

1. Unpredictable EBS disk I/O performance. Note that gp2 volumes do not appear to need as much warmup or priming as before.

2. EC2 Instance ECU Mismatch and Stolen CPU. advice: use bigger instances

The other 3 ways are a little obvious by comparison, but worth bookmarking for those two anyway.
ops  ec2  performance  datadog  aws  ebs  stolen-cpu  virtualization  metrics  tips 
11 weeks ago by jm
How Did I “Hack” AWS Lambda to Run Docker Containers?
Running Docker containers in Lambda using a usermode-docker hack -- hacky as hell but fun ;) Lambda should really support native Docker though
docker  lambda  aws  serverless  ops  hacks  udocker 
june 2017 by jm
Open Guide to Amazon Web Services
'A lot of information on AWS is already written. Most people learn AWS by reading a blog or a “getting started guide” and referring to the standard AWS references. Nonetheless, trustworthy and practical information and recommendations aren’t easy to come by. AWS’s own documentation is a great but sprawling resource few have time to read fully, and it doesn’t include anything but official facts, so omits experiences of engineers. The information in blogs or Stack Overflow is also not consistently up to date. This guide is by and for engineers who use AWS. It aims to be a useful, living reference that consolidates links, tips, gotchas, and best practices. It arose from discussion and editing over beers by several engineers who have used AWS extensively.'
amazon  aws  guides  documentation  ops  architecture 
june 2017 by jm
usl4j And You | codahale.com
Coda Hale wrote a handy java library implementing a USL solver
usl  scalability  java  performance  optimization  benchmarking  measurement  ops  coda-hale 
june 2017 by jm
Scaling Amazon Aurora at ticketea
Ticketing is a business in which extreme traffic spikes are the norm, rather than the exception. For Ticketea, this means that our traffic can increase by a factor of 60x in a matter of seconds. This usually happens when big events (which have a fixed, pre-announced 'sale start time') go on sale.
scaling  scalability  ops  aws  aurora  autoscaling  asg 
may 2017 by jm
Enough with the microservices
Good post!
Much has been written on the pros and cons of microservices, but unfortunately I’m still seeing them as something being pursued in a cargo cult fashion in the growth-stage startup world. At the risk of rewriting Martin Fowler’s Microservice Premium article, I thought it would be good to write up some thoughts so that I can send them to clients when the topic arises, and hopefully help people avoid some of the mistakes I’ve seen. The mistake of choosing a path towards a given architecture or technology on the basis of so-called best practices articles found online is a costly one, and if I can help a single company avoid it then writing this will have been worth it.
architecture  design  microservices  coding  devops  ops  monolith 
may 2017 by jm
Sorry
hosted status page / downtime banner service
banners  web  status  uptime  downtime  ops  reliability 
may 2017 by jm
Spotting a million dollars in your AWS account · Segment Blog
You can easily split your spend by AWS service per month and call it a day. Ten thousand dollars of EC2, one thousand to S3, five hundred dollars to network traffic, etc. But what’s still missing is a synthesis of which products and engineering teams are dominating your costs. 

Then, add in the fact that you may have hundreds of instances and millions of containers that come and go. Soon, what started as simple analysis problem has quickly become unimaginably complex. 

In this follow-up post, we’d like to share details on the toolkit we used. Our hope is to offer up a few ideas to help you analyze your AWS spend, no matter whether you’re running only a handful of instances, or tens of thousands.

segment  money  costs  billing  aws  ec2  ecs  ops 
may 2017 by jm
jantman/awslimitchecker

A script and python module to check your AWS service limits and usage, and warn when usage approaches limits.

Users building out scalable services in Amazon AWS often run into AWS' service limits - often at the least convenient time (i.e. mid-deploy or when autoscaling fails). Amazon's Trusted Advisor can help this, but even the version that comes with Business and Enterprise support only monitors a small subset of AWS limits and only alerts weekly. awslimitchecker provides a command line script and reusable package that queries your current usage of AWS resources and compares it to limits (hard-coded AWS defaults that you can override, API-based limits where available, or data from Trusted Advisor where available), notifying you when you are approaching or at your limits.


(via This Week in AWS)
aws  amazon  limits  scripts  ops 
may 2017 by jm
cristim/autospotting: Pay up to 10 times less on EC2 by automatically replacing on-demand AutoScaling group members with similar or larger identically configured spot instances.
A simple and easy to use tool designed to significantly lower your Amazon AWS costs by automating the use of the spot market.

Once enabled on an existing on-demand AutoScaling group, it launches an EC2 spot instance that is cheaper, at least as large and configured identically to your current on-demand instances. As soon as the new instance is ready, it is added to the group and an on-demand instance is detached from the group and terminated.

It continuously applies this process, gradually replacing any on-demand instances with spot instances until the group only consists of spot instances, but it can also be configured to keep some on-demand instances running.
aws  golang  ec2  autoscaling  asg  spot-instances  ops 
may 2017 by jm
acksin/seespot: AWS Spot instance health check with termination and clean up support
When a Spot Instance is about to terminate there is a 2 minute window before the termination actually happens. SeeSpot is a utility for AWS Spot instances that handles the health check. If used with an AWS ELB it also handles cleanup of the instance when a Spot Termination notice is sent.
aws  elb  spot-instances  health-checks  golang  lifecycle  ops 
may 2017 by jm
NetSpot
'FREE WiFi Site Survey Software for MAC OS X & Windows'.
Sadly reviews from pals are that it is 'shite' :(
osx  wifi  network  survey  netspot  networking  ops  dataviz  wireless 
april 2017 by jm
Julia Evans on Twitter: "notes on this great "When the pager goes off" article"
'notes on this great "When the pager goes off" article from @incrementmag https://increment.com/on-call/when-the-pager-goes-off/ ' -- cartoon summarising a much longer article of common modern ops on-call response techniques. Still pretty consistent with the systems we used in Amazon
on-call  ops  incident-response  julia-evans  pager  increment-mag 
april 2017 by jm
Ubuntu on AWS gets serious performance boost with AWS-tuned kernel
interesting -- faster boots, CPU throttling resolved on t2.micros, other nice stuff
aws  ubuntu  ec2  kernel  linux  ops 
april 2017 by jm
Spotify’s Love/Hate Relationship with DNS
omg somebody at Spotify really really loves DNS. They even store a DHT hash ring in it. whyyyyyyyyyyy
spotify  networking  architecture  dht  insane  scary  dns  unbound  ops 
april 2017 by jm
Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
Solid article proselytising runbooks/playbooks (or in this article's parlance, "Incident Models") for dev/ops handover and operational knowledge
ops  process  sre  devops  runbooks  playbooks  incident-models 
april 2017 by jm
Deep Dive on Amazon EBS Elastic Volumes
'March 2017 AWS Online Tech Talks' -- lots about the new volume types
aws  ebs  storage  architecture  ops  slides 
march 2017 by jm
Learn redis the hard way (in production) · trivago techblog
oh god this is pretty awful. this just reads like "don't try to use Redis at scale" to me
redis  scalability  ops  architecture  horror  trivago  php 
march 2017 by jm
ctop
Top for containers (ie Docker)
docker  containers  top  ops  go  monitoring  cpu 
march 2017 by jm
How to stop Ubuntu Xenial (16.04) from randomly killing your big processes
ugh.
Unfortunately, a bug was recently introduced into the allocator which made it sometimes not try hard enough to free kernel cache memory before giving up and invoking the OOM killer. In practice, this means that at random times, the OOM killer would strike at big processes when the kernel tries to allocate, say, 16 kilobytes of memory for a new process’s thread stack — even when there are many gigabytes of memory in reclaimable kernel caches!
oom-killer  ooms  linux  ops  16.04 
march 2017 by jm
Annotated tenets of SRE
A google SRE annotates the Google SRE book with his own thoughts. The source material is great, but the commentary improves it alright.

Particularly good for the error budget concept.

Also: when did "runbooks" become "playbooks"? Don't particularly care either way, but needless renaming is annoying.
runbooks  playbooks  ops  google  sre  error-budget 
march 2017 by jm
The Occasional Chaos of AWS Lambda Runtime Performance
If our code has modest resource requirements, and can tolerate large changes in performance, then it makes sense to start with the least amount of memory necessary. On the other hand, if consistency is important, the best way to achieve that is by cranking the memory setting all the way up to 1536MB.
It’s also worth noting here that CPU-bound Lambdas may be cheaper to run over time with a higher memory setting, as Jim Conning describes in his article, “AWS Lambda: Faster is Cheaper”. In our tests, we haven’t seen conclusive evidence of that behavior, but much more data is required to draw any strong conclusions.
The other lesson learned is that Lambda benchmarks should be gathered over the course of days, not hours or minutes, in order to provide actionable information. Otherwise, it’s possible to see very impressive performance from a Lambda that might later dramatically change for the worse, and any decisions made based on that information will be rendered useless.
aws  lambda  amazon  performance  architecture  ops  benchmarks 
march 2017 by jm
S3 2017-02-28 outage post-mortem
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.  
s3  postmortem  aws  post-mortem  outages  cms  ops 
march 2017 by jm
"I caused an outage" thread on twitter
Anil Dash: "What was the first time you took the website down or broke the build? I’m thinking of all the inadvertent downtime that comes with shipping."

Sample response: 'Pushed a fatal error in lib/display.php to all of FB’s production servers one Friday night in late 2005. Site loaded blank pages for 20min.'
outages  reliability  twitter  downtime  fail  ops  post-mortem 
march 2017 by jm
Gravitational Teleport
Teleport enables teams to easily adopt the best SSH practices like:

Integrated SSH credentials with your organization Google Apps identities or other OAuth identity providers.
No need to distribute keys: Teleport uses certificate-based access with automatic expiration time.
Enforcement of 2nd factor authentication.
Cluster introspection: every Teleport node becomes a part of a cluster and is visible on the Web UI.
Record and replay SSH sessions for knowledge sharing and auditing purposes.
Collaboratively troubleshoot issues through session sharing.
Connect to clusters located behind firewalls without direct Internet access via SSH bastions.
ssh  teleport  ops  bastions  security  auditing  oauth  2fa 
february 2017 by jm
How-to Debug a Running Docker Container from a Separate Container
arguably this shouldn't be required -- building containers without /bin/sh, strace, gdb etc. is just silly
strace  docker  ops  debugging  containers 
february 2017 by jm
10 Most Common Reasons Kubernetes Deployments Fail
some real-world failure cases and how to fix them
kubernetes  docker  ops 
february 2017 by jm
Instapaper Outage Cause & Recovery
Hard to see this as anything other than a pretty awful documentation fail by the AWS RDS service:
Without knowledge of the pre-April 2014 file size limit, it was difficult to foresee and prevent this issue. As far as we can tell, there’s no information in the RDS console in the form of monitoring, alerts or logging that would have let us know we were approaching the 2TB file size limit, or that we were subject to it in the first place. Even now, there’s nothing to indicate that our hosted database has a critical issue.
limits  aws  rds  databases  mysql  filesystems  ops  instapaper  risks 
february 2017 by jm
sparkey
Spotify's read-only k/v store
spotify  sparkey  read-only  key-value  storage  ops  architecture 
february 2017 by jm
square/shift
'shift is a [web] application that helps you run schema migrations on MySQL databases'
databases  mysql  sql  migrations  ops  square  ddl  percona 
february 2017 by jm
A server with 24 years of uptime
wow. Stratus fault-tolerant systems ftw.

'This is a fault tolerant server, which means that hardware components are redundant. Over the years, disk drives, power supplies and some other components have been replaced but Hogan estimates that close to 80% of the system is original.'

(via internetofshit, which this isn't)
stratus  fault-tolerance  hardware  uptime  records  ops 
january 2017 by jm
Google - Site Reliability Engineering
The Google SRE book is now online, for free
sre  google  ops  books  reading 
january 2017 by jm
PagerDuty Incident Response Documentation
This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).


This is a really good set of processes -- quite similar to what we used in Amazon for high-severity outage response.
ops  process  outages  pagerduty  incident-response  incidents  on-call 
january 2017 by jm
Leap Smear  |  Public NTP  |  Google Developers
Google offers public NTP service with leap smearing -- I didn't realise! (thanks Keith)
google  clocks  time  ntp  leap-smearing  leap-second  ops 
january 2017 by jm
AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205) - YouTube
Yelp talk about their Spot Fleet price optimization autoscaler app, FleetMiser
yelp  scaling  aws  spot-fleet  ops  spot-instances  money 
december 2016 by jm
Auto scaling Pinterest
notes on a second-system take on autoscaling -- Pinterest tried it once, it didn't take, and this is the rerun. I like the tandem ASG approach (spots and nonspots)
spot-instances  scaling  aws  scalability  ops  architecture  pinterest  via:highscalability 
december 2016 by jm
Syscall Auditing at Scale
auditd -> go-audit -> elasticsearch at Slack
elasticsearch  auditd  syscalls  auditing  ops  slack 
november 2016 by jm
Etsy Debriefing Facilitation Guide
by John Allspaw, Morgan Evans and Daniel Schauenberg; the Etsy blameless postmortem style crystallized into a detailed 27-page PDF ebook
etsy  postmortems  blameless  ops  production  debriefing  ebooks 
november 2016 by jm
Julia Evans reverse engineers Skyliner.io
simple usage of Docker, blue/green deploys, and AWS ALBs
docker  alb  aws  ec2  blue-green-deploys  deployment  ops  tools  skyliner  via:jgilbert 
november 2016 by jm
Testing Docker multi-host network performance - Percona Database Performance Blog
wow, Docker Swarm looks like a turkey right now if performance is important. Only "host" gives reasonably perf numbers
docker  networking  performance  ops  benchmarks  testing  swarm  overlay  calico  weave  bridge 
november 2016 by jm
Measuring Docker IO overhead - Percona Database Performance Blog
See also https://www.percona.com/blog/2016/02/05/measuring-docker-cpu-network-overhead/ for the CPU/Network equivalent. The good news is that nowadays it's virtually 0 when the correct settings are used
docker  percona  overhead  mysql  deployment  performance  ops  containers 
november 2016 by jm
The Square Root Staffing Law
The square root staffing law is a rule of thumb derived from queueing theory, useful for getting an estimate of the capacity you might need to serve an increased amount of traffic.
ops  capacity  planning  rules-of-thumb  qed-regime  efficiency  architecture 
november 2016 by jm
'Jupiter rising: A decade of Clos topologies and centralized control in Google’s datacenter networks'
Love the 'decade of' dig at FB and Amazon -- 'we were doing it first' ;)

Great details on how Google have built out and improved their DC networking. Includes a hint that they now use DCTCP (datacenter-optimized TCP congestion control) on their internal hosts....
datacenter  google  presentation  networks  networking  via:irldexter  ops  sre  clos-networks  fabrics  switching  history  datacenters 
october 2016 by jm
Best practices with Airflow
interesting presentation describing how to architect Airflow ETL setups; see also https://gtoonstra.github.io/etl-with-airflow/principles.html
etl  airflow  batch  architecture  systems  ops 
october 2016 by jm
Kafka Streams - Scaling up or down
this is a nice zero-config scaling story -- good work Kafka Streams
scaling  scalability  architecture  kafka  streams  ops 
october 2016 by jm
Charity Majors responds to the CleverTap Mongo outage war story
This is a great blog post, spot on:
You can’t just go “dudes it’s faster” and jump off a cliff.  This shit is basic.  Test real production workloads. Have a rollback plan.  (Not for *10 days* … try a month or two.)


The only thing I'd nitpick on is that it's all very well to say "buy my book" or "come see me talk at Blahcon", but a good blog post or webpage would be thousands of times more useful.
databases  stateful-services  services  ops  mongodb  charity-majors  rollback  state  storage  testing  dba 
october 2016 by jm
Airflow/AMI/ASG nightly-packaging workflow
Some tantalising discussion on twitter of an Airflow + AMI + ASG workflow for ML packaging:

'We build models using Airflow. We deploy new models as AMIs where each AMI is model + scoring code. The AMI is hence a version of code + model at a point in time : #immutable_infrastructure. It's natural for Airflow to build & deploy the model+code with each Airflow DAG Run corresponding to a versioned AMI. if there's a problem, we can simply roll back to the previous AMI & identify the problematic model building Dag run. Since we use ASGs, Airflow can execute a rolling deploy of new AMIs. We could also have it do a validation & ASG rollback of the AMI if validation fails. Airflow is being used for reliable Model build+validation+deployment.'
ml  packaging  airflow  asg  ami  deployment  ops  infrastructure  rollback 
september 2016 by jm
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
Some good slides with tips on running java apps in production in Docker
java  docker  ops  containers 
september 2016 by jm
Skyliner
Coda Hale's new gig on how they're using Docker, AWS, etc. I like this: "Use containers. Not too much. Mostly for packaging."
docker  aws  packaging  ops  devops  containers  skyliner 
september 2016 by jm
Auto Scaling for EC2 Spot Fleets
'we are enhancing the Spot Fleet model with the addition of Auto Scaling. You can now arrange to scale your fleet up and down based on a Amazon CloudWatch metric. The metric can originate from an AWS service such as EC2, Amazon EC2 Container Service, or Amazon Simple Queue Service (SQS). Alternatively, your application can publish a custom metric and you can use it to drive the automated scaling.'
asg  auto-scaling  ec2  spot-fleets  ops  scaling 
september 2016 by jm
Introducing Winston
'Event driven Diagnostic and Remediation Platform' -- aka 'runbooks as code'
runbooks  winston  netflix  remediation  outages  mttr  ops  devops 
august 2016 by jm
AWS Case Study: mytaxi
ECS, Docker, ELB, SQS, SNS, RDS, VPC, and spot instances. Pretty canonical setup these days...
The mytaxi app is also now able to predict daily and weekly spikes. In addition, it has gained the elasticity required to meet demand during special events. Herzberg describes a typical situation on New Year's Eve: “Shortly before midnight everyone needs a taxi to get to parties, and after midnight people want to go home. In past years we couldn't keep up with the demand this generated, which was around three and a half times as high as normal. In November 2015 we moved our Docker container architecture to Amazon ECS, and for the first time ever in December we were able to celebrate a new year in which our system could handle the huge number of requests without any crashes or interruptions—an accomplishment that we were extremely proud of. We had faced the biggest night on the calendar without any downtime.”
mytaxi  aws  ecs  docker  elb  sqs  sns  rds  vpc  spot-instances  ops  architecture 
august 2016 by jm
My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <rob@infinitepigeons.org>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
july 2016 by jm
Raintank investing in Graphite
paying Jason Dixon to work on it, improving the backend, possibly replacing the creaky Whisper format. great news!
graphite  metrics  monitoring  ops  open-source  grafana  raintank 
july 2016 by jm
USE Method: Linux Performance Checklist
Really late in bookmarking this, but has some up-to-date sample commandlines for sar, mpstat and iostat on linux
linux  sar  iostat  mpstat  cli  ops  sysadmin  performance  tuning  use  metrics 
june 2016 by jm
« earlier      
per page:    204080120160

related tags

2fa  10/8  16.04  32bit  accept  accidents  accounts  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  advice  agpl  airbnb  airflow  aix  alarm-fatigue  alarming  alarms  alb  alert-logic  alerting  alerts  alestic  algorithms  allspaw  alter-table  ama  amazon  ami  analysis  analytics  anomaly-detection  antarctica  anti-spam  antipatterns  anycast  ap  apache  aphyr  api  api-gateway  apis  app-engine  apt  archaius  architecture  arq  asg  asgard  aspirations  assembly  atlas  atomic  auditd  auditing  aufs  aurora  authentication  authz  auto-remediation  auto-scaling  automation  autoremediation  autoscaling  availability  aws  awsume  az  azure  b2  backblaze  background  backlog  backpressure  backup  backups  banking  banners  bare-metal  baron-schwartz  bash  basho  bastions  batch  bbc  bdb  bdb-je  bdd  beanstalk  ben-maurer  ben-treynor  benchmarking  benchmarks  best-practices  big-data  billing  bind  bit-errors  bitcoin  bitly  bitrot  blake2  blameless  bloat  blockdev  blogs  blue-green-deployments  blue-green-deploys  books  boot2docker  borg  boundary  bpf  brendan-gregg  bridge  broadcast  bryan-cantrill  bsd  btrfs  bugs  build  build-out  building  bureaucracy  burst-balance  byteman  c  ca  ca-7  caches  caching  calico  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  cascading-failures  case-studies  cassandra  cd  cdn  censum  certificates  certs  cfengine  cgroups  change-management  change-monitoring  changes  chaos-kong  chaos-monkey  charity-majors  charts  chatops  checkip  checklists  chef  chefspec  china  chronos  ci  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cleanup  cli  clocks  clos-networks  cloud  cloud-storage  cloudera  cloudflare  cloudfront  cloudnative  cloudwatch  cluster  clustering  clusters  cms  coda-hale  code-spaces  codeascraft  codedeploy  codel  coding  coes  coinbase  cold  collaboration  command-line  commandline  commercial  company  compatibility  complexity  compression  computation  concurrency  conferences  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy  copy-on-write  copyright  coreos  coreutils  corruption  costs  counting  coursera  cp  cpu  crash-only-software  credentials  critiques  criu  cron  cross-region  crypto  cubism  culture  curl  daemon  daemons  dan-luu  danilop  dark-releases  dashboards  data  data-centers  data-corruption  data-loss  database  database-is-not-a-queue  databases  datacenter  datacenters  datadog  dataviz  datawire  dba  dbus  ddl  debian  debriefing  debug  debugging  decay  defrag  delete  delivery  delta  demo  dependencies  deploy  deployinator  deployment  derp  design  desktops  dev  developers  development  deviance  devops  dht  diagnosis  digital-ocean  disaster-recovery  disk  disk-space  disks  distcomp  distributed  distributed-cron  distributed-systems  distros  diy  dmca  dns  dnsmasq  docker  documentation  dotcloud  downtime  dr  drivers  dropbox  dstat  duplicity  duply  dynalite  dynamic  dynamic-configuration  dynamodb  dynect  ebooks  ebs  ec2  ec2rl  ecs  efficiency  elastic-scaling  elasticache  elasticsearch  elb  email  emr  emrfs  encryption  engineering  ensemble  environments  erasure-coding  ergonomics  error-budget  etcd  etl  etsy  eureka  ev  event-management  eventual-consistency  exception-handling  exercises  exponential-decay  ext3  ext4  extortion  fabric  fabrics  facebook  facette  fail  failover  failure  false-positives  fault-tolerance  fcron  feature-flags  fedora  file-transfer  filesystems  fincore  firefighting  five-whys  flapjack  flavour-of-the-month  flock  flow-logs  forecasting  foursquare  freebsd  front-ends  frontline  fs  fsync  ftrace  fuse  g1  g1gc  gae  game-days  games  gating  gc  gce  gcp  genomics  gifee  gil-tene  gilt  gilt-groupe  git  github  gnome  go  god  golang  google  gossip  grafana  graphing  graphite  graphs  gruffalo  guides  gulp  gzip  ha  hacks  hadoop  hailo  haproxy  hardware  hbase  hdds  hdfs  health-checks  heap  heartbeats  heka  hero-coder  hero-culture  heron  hiccups  hidden-costs  history  hn  holt-winters  home  honeypot  horizon-charts  horror  hosting  hotels  hotspot  howto  hrd  http  http2  httpry  https  huge-pages  human-factors  humor  hvm  hyperthreading  hystrix  iam  ian-wilkes  ibm  icecube  ifttt  images  imaging  inactivity  incident-models  incident-response  incidents  increment-mag  indexes  inept  influxdb  infrastructure  init  injection  insane  inspeqtor  instapaper  instrumentation  integration-testing  integration-tests  inter-region  internet  internet-scale  interviews  inviso  io  iops  iostat  ioutil  ip  ip-addresses  iptables  ironfan  james-hamilton  java  javascript  jay-kreps  jcmd  jdk  jemalloc  jenkins  jepsen  jit  jmx  jmxtrans  john-allspaw  journalling  joyent  jstat  julia-evans  juniper  jvm  k8s  kafka  kdd  kde  kellabyte  kernel  key-distribution  key-rotation  key-value  keybox  keys  keywhiz  kill-9  knife  kubernetes  lambda  languages  laptops  latency  layout  lbs  leap-second  leap-smearing  legacy  leveldb  lhm  lhtable  libc  librato  lifecycle  lifespan  limits  linden  linkedin  linkerd  links  linode  linux  liquid-cooling  listen-backlog  lists  live  lmax  load  load-balancers  load-balancing  load-testing  locking  logentries  logging  loggly  loose-coupling  lsb  lsof  lsx  luks  lxc  mac  machine-learning  macosx  madvise  mail  maintainance  mandos  manta  map-reduce  mapreduce  measurement  measurements  mechanical-sympathy  memory  mesos  metrics  mfa  microservices  microsoft  migration  migrations  mincore  mirroring  mit  ml  mmap  mocha  money  mongodb  monit  monitorama  monitoring  monolith  movies  mozilla  mpstat  mtbf  mttr  multi-cloud  multi-region  multiplexing  mysql  mytaxi  nagios  namespaces  nannies  nas  nat  natwest  nerve  netdata  netflix  nethogs  netspot  netstat  netty  network  network-monitoring  network-partitions  networking  networks  new-relic  nginx  niall-murphy  nix  nixos  nixpkgs  nlb  node.js  normalization-of-deviance  norms  nosql  notification  notifications  npm  ntp  ntpd  nuclear-power  nurse  oauth  obama  omega  omniti  on-call  oom  oom-killer  ooms  open-source  openjdk  operability  operations  ops  opsgenie  optimization  oreilly  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outbrain  outsourcing  overhead  overlay  overlayfs  ovh  owasp  packaging  packet-capture  packets  page-cache  pager  pager-duty  pagerduty  pages  paging  papers  parse  partition  partitions  passenger  patterns  paxos  pbailis  pcp  pcp2graphite  pdf  peering  percentiles  percona  performance  php  phusion  pie  pillar  pinball  ping  pinterest  piops  pipelines  pixar  pki  placement  planning  platform  platforms  playbooks  plumbr.eu  post-mortem  post-mortems  postgres  postmortem  postmortems  presentation  presentations  presos  pricing  princess  prioritisation  procedures  process  processes  procfs  prod  production  profiling  programming  prometheus  provisioning  proxies  proxy  proxying  pty  puppet  pv  python  qa  qdisc  qed-regime  questions  queueing  rabbitmq  race-conditions  rafe-colburn  raid  rails  raintank  rami-rosen  randomization  ranking  rant  rate-limiting  rbs  rc3  rdbms  rds  read-only  reading  real-time  records  recovery  recurrence  red-hat  reddit  redis  redshift  refactoring  reference  registry  regression-testing  reinvent  release  releases  reliability  reliabilty  remediation  replicas  replication  request-routing  resiliency  resource-limits  restarting  restoring  rethinkdb  reverse-proxy  reversibility  reviews  rewrites  riak  riemann  ripienaar  risks  rkt  rm-rf  rmi  rob-ewaschuk  rocket  rocksdb  rollback  root-cause  root-causes  route53  routing  rspec  ruby  rules-of-thumb  runbooks  runit  runjop  rvm  rwasa  s3  s3funnel  s3ql  saas  safety  sam  sandboxing  sanity-checks  sar  scala  scalability  scale  scaling  scary  scheduler  scheduling  schema  scripts  sdd  sdn  seagate  search  secrets  security  seesaw  segment  sensu  serf  serialization  server  serverless  servers  serverspec  service-discovery  service-metrics  service-registry  services  ses  sev1  severity  sharding  shippable  shodan  shopify  shorn-writes  signalfx  silos  sjk  skyliner  slack  slashdot  sleep  slew  slides  smartstack  smoke-tests  smtp  snappy  snapshots  sns  soa  sockets  software  solaris  soundcloud  south-pole  space  spark  sparkey  spdy  speculative-execution  spinnaker  split-brain  spot-fleet  spot-fleets  spot-instances  spotify  sql  sqs  square  sre  ssd  ssh  ssl  stack  stack-size  stackoverflow  stacks  stackshare  staging  startup  state  stateful-services  statistics  stats  statsd  statsite  status  stephanie-dean  stepping  stolen-cpu  storage  stores  storm  strace  stratus  stream-processing  streaming  streams  stress-testing  strider  stripe  sts  supervision  supervisord  support  survey  svctm  swarm  switching  syadmin  symantec  synapse  sysadmin  sysadvent  syscalls  sysdig  syslog  sysstat  system  system-testing  system-v  systemd  systems  tahoe-lafs  talks  tc  tcp  tcpcopy  tcpdump  tdd  teams  tech  tech-debt  technical-debt  techops  tee  telefonica  telemetry  teleport  terraform  testing  tests  thp  threadpools  threads  three-mile-island  throughput  thundering-herd  ticketea  tier-one-support  tildeslash  time  time-machine  time-series  time-synchronization  timeouts  tips  tls  tools  top  toread  tos  trace  tracer-requests  tracing  trading  traefik  training  transactional-updates  transparent-huge-pages  trivago  troubleshooting  tsd  tuning  turing-complete  twilio  twisted  twitter  two-factor-authentication  uat  ubuntu  ubuntu-core  udocker  ui  ulster-bank  ultradns  unbound  unicorn  unikernels  unit-testing  unit-tests  unix  upgrades  upstart  uptime  urls  use  uselessd  usenix  user-submitted-code  usl  ux  vagrant  varnish  vector  version-control  versioning  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:eoinbrazil  via:fanf  via:feylya  via:filippo  via:highscalability  via:irldexter  via:jgilbert  via:jk  via:kragen  via:lusis  via:marc  via:markkenny  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  vips  virtualisation  virtualization  visualisation  vividcortex  vm  vms  voldemort  vpc  weave  web  web-services  webmail  weighting  whats-my-ip  wifi  wiki  winston  wipac  wireless  wishlist  wlan  work  workflow  workflows  workplaces  x86_64  xen  xfs  xooglers  yahoo  yammer  yelp  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark:



description:


tags: