jm + ops   414

NetSpot
'FREE WiFi Site Survey Software for MAC OS X & Windows'.
Sadly reviews from pals are that it is 'shite' :(
osx  wifi  network  survey  netspot  networking  ops  dataviz  wireless 
7 days ago by jm
Julia Evans on Twitter: "notes on this great "When the pager goes off" article"
'notes on this great "When the pager goes off" article from @incrementmag https://increment.com/on-call/when-the-pager-goes-off/ ' -- cartoon summarising a much longer article of common modern ops on-call response techniques. Still pretty consistent with the systems we used in Amazon
on-call  ops  incident-response  julia-evans  pager  increment-mag 
15 days ago by jm
Ubuntu on AWS gets serious performance boost with AWS-tuned kernel
interesting -- faster boots, CPU throttling resolved on t2.micros, other nice stuff
aws  ubuntu  ec2  kernel  linux  ops 
17 days ago by jm
Spotify’s Love/Hate Relationship with DNS
omg somebody at Spotify really really loves DNS. They even store a DHT hash ring in it. whyyyyyyyyyyy
spotify  networking  architecture  dht  insane  scary  dns  unbound  ops 
18 days ago by jm
Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
Solid article proselytising runbooks/playbooks (or in this article's parlance, "Incident Models") for dev/ops handover and operational knowledge
ops  process  sre  devops  runbooks  playbooks  incident-models 
18 days ago by jm
Deep Dive on Amazon EBS Elastic Volumes
'March 2017 AWS Online Tech Talks' -- lots about the new volume types
aws  ebs  storage  architecture  ops  slides 
29 days ago by jm
Learn redis the hard way (in production) · trivago techblog
oh god this is pretty awful. this just reads like "don't try to use Redis at scale" to me
redis  scalability  ops  architecture  horror  trivago  php 
29 days ago by jm
ctop
Top for containers (ie Docker)
docker  containers  top  ops  go  monitoring  cpu 
6 weeks ago by jm
How to stop Ubuntu Xenial (16.04) from randomly killing your big processes
ugh.
Unfortunately, a bug was recently introduced into the allocator which made it sometimes not try hard enough to free kernel cache memory before giving up and invoking the OOM killer. In practice, this means that at random times, the OOM killer would strike at big processes when the kernel tries to allocate, say, 16 kilobytes of memory for a new process’s thread stack — even when there are many gigabytes of memory in reclaimable kernel caches!
oom-killer  ooms  linux  ops  16.04 
7 weeks ago by jm
Annotated tenets of SRE
A google SRE annotates the Google SRE book with his own thoughts. The source material is great, but the commentary improves it alright.

Particularly good for the error budget concept.

Also: when did "runbooks" become "playbooks"? Don't particularly care either way, but needless renaming is annoying.
runbooks  playbooks  ops  google  sre  error-budget 
7 weeks ago by jm
The Occasional Chaos of AWS Lambda Runtime Performance
If our code has modest resource requirements, and can tolerate large changes in performance, then it makes sense to start with the least amount of memory necessary. On the other hand, if consistency is important, the best way to achieve that is by cranking the memory setting all the way up to 1536MB.
It’s also worth noting here that CPU-bound Lambdas may be cheaper to run over time with a higher memory setting, as Jim Conning describes in his article, “AWS Lambda: Faster is Cheaper”. In our tests, we haven’t seen conclusive evidence of that behavior, but much more data is required to draw any strong conclusions.
The other lesson learned is that Lambda benchmarks should be gathered over the course of days, not hours or minutes, in order to provide actionable information. Otherwise, it’s possible to see very impressive performance from a Lambda that might later dramatically change for the worse, and any decisions made based on that information will be rendered useless.
aws  lambda  amazon  performance  architecture  ops  benchmarks 
7 weeks ago by jm
S3 2017-02-28 outage post-mortem
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.  
s3  postmortem  aws  post-mortem  outages  cms  ops 
8 weeks ago by jm
"I caused an outage" thread on twitter
Anil Dash: "What was the first time you took the website down or broke the build? I’m thinking of all the inadvertent downtime that comes with shipping."

Sample response: 'Pushed a fatal error in lib/display.php to all of FB’s production servers one Friday night in late 2005. Site loaded blank pages for 20min.'
outages  reliability  twitter  downtime  fail  ops  post-mortem 
8 weeks ago by jm
Gravitational Teleport
Teleport enables teams to easily adopt the best SSH practices like:

Integrated SSH credentials with your organization Google Apps identities or other OAuth identity providers.
No need to distribute keys: Teleport uses certificate-based access with automatic expiration time.
Enforcement of 2nd factor authentication.
Cluster introspection: every Teleport node becomes a part of a cluster and is visible on the Web UI.
Record and replay SSH sessions for knowledge sharing and auditing purposes.
Collaboratively troubleshoot issues through session sharing.
Connect to clusters located behind firewalls without direct Internet access via SSH bastions.
ssh  teleport  ops  bastions  security  auditing  oauth  2fa 
8 weeks ago by jm
How-to Debug a Running Docker Container from a Separate Container
arguably this shouldn't be required -- building containers without /bin/sh, strace, gdb etc. is just silly
strace  docker  ops  debugging  containers 
9 weeks ago by jm
10 Most Common Reasons Kubernetes Deployments Fail
some real-world failure cases and how to fix them
kubernetes  docker  ops 
9 weeks ago by jm
Instapaper Outage Cause & Recovery
Hard to see this as anything other than a pretty awful documentation fail by the AWS RDS service:
Without knowledge of the pre-April 2014 file size limit, it was difficult to foresee and prevent this issue. As far as we can tell, there’s no information in the RDS console in the form of monitoring, alerts or logging that would have let us know we were approaching the 2TB file size limit, or that we were subject to it in the first place. Even now, there’s nothing to indicate that our hosted database has a critical issue.
limits  aws  rds  databases  mysql  filesystems  ops  instapaper  risks 
10 weeks ago by jm
sparkey
Spotify's read-only k/v store
spotify  sparkey  read-only  key-value  storage  ops  architecture 
11 weeks ago by jm
square/shift
'shift is a [web] application that helps you run schema migrations on MySQL databases'
databases  mysql  sql  migrations  ops  square  ddl  percona 
12 weeks ago by jm
A server with 24 years of uptime
wow. Stratus fault-tolerant systems ftw.

'This is a fault tolerant server, which means that hardware components are redundant. Over the years, disk drives, power supplies and some other components have been replaced but Hogan estimates that close to 80% of the system is original.'

(via internetofshit, which this isn't)
stratus  fault-tolerance  hardware  uptime  records  ops 
12 weeks ago by jm
Google - Site Reliability Engineering
The Google SRE book is now online, for free
sre  google  ops  books  reading 
12 weeks ago by jm
PagerDuty Incident Response Documentation
This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).


This is a really good set of processes -- quite similar to what we used in Amazon for high-severity outage response.
ops  process  outages  pagerduty  incident-response  incidents  on-call 
january 2017 by jm
Leap Smear  |  Public NTP  |  Google Developers
Google offers public NTP service with leap smearing -- I didn't realise! (thanks Keith)
google  clocks  time  ntp  leap-smearing  leap-second  ops 
january 2017 by jm
AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205) - YouTube
Yelp talk about their Spot Fleet price optimization autoscaler app, FleetMiser
yelp  scaling  aws  spot-fleet  ops  spot-instances  money 
december 2016 by jm
Auto scaling Pinterest
notes on a second-system take on autoscaling -- Pinterest tried it once, it didn't take, and this is the rerun. I like the tandem ASG approach (spots and nonspots)
spot-instances  scaling  aws  scalability  ops  architecture  pinterest  via:highscalability 
december 2016 by jm
Syscall Auditing at Scale
auditd -> go-audit -> elasticsearch at Slack
elasticsearch  auditd  syscalls  auditing  ops  slack 
november 2016 by jm
Etsy Debriefing Facilitation Guide
by John Allspaw, Morgan Evans and Daniel Schauenberg; the Etsy blameless postmortem style crystallized into a detailed 27-page PDF ebook
etsy  postmortems  blameless  ops  production  debriefing  ebooks 
november 2016 by jm
Julia Evans reverse engineers Skyliner.io
simple usage of Docker, blue/green deploys, and AWS ALBs
docker  alb  aws  ec2  blue-green-deploys  deployment  ops  tools  skyliner  via:jgilbert 
november 2016 by jm
Testing Docker multi-host network performance - Percona Database Performance Blog
wow, Docker Swarm looks like a turkey right now if performance is important. Only "host" gives reasonably perf numbers
docker  networking  performance  ops  benchmarks  testing  swarm  overlay  calico  weave  bridge 
november 2016 by jm
Measuring Docker IO overhead - Percona Database Performance Blog
See also https://www.percona.com/blog/2016/02/05/measuring-docker-cpu-network-overhead/ for the CPU/Network equivalent. The good news is that nowadays it's virtually 0 when the correct settings are used
docker  percona  overhead  mysql  deployment  performance  ops  containers 
november 2016 by jm
The Square Root Staffing Law
The square root staffing law is a rule of thumb derived from queueing theory, useful for getting an estimate of the capacity you might need to serve an increased amount of traffic.
ops  capacity  planning  rules-of-thumb  qed-regime  efficiency  architecture 
november 2016 by jm
'Jupiter rising: A decade of Clos topologies and centralized control in Google’s datacenter networks'
Love the 'decade of' dig at FB and Amazon -- 'we were doing it first' ;)

Great details on how Google have built out and improved their DC networking. Includes a hint that they now use DCTCP (datacenter-optimized TCP congestion control) on their internal hosts....
datacenter  google  presentation  networks  networking  via:irldexter  ops  sre  clos-networks  fabrics  switching  history  datacenters 
october 2016 by jm
Best practices with Airflow
interesting presentation describing how to architect Airflow ETL setups; see also https://gtoonstra.github.io/etl-with-airflow/principles.html
etl  airflow  batch  architecture  systems  ops 
october 2016 by jm
Kafka Streams - Scaling up or down
this is a nice zero-config scaling story -- good work Kafka Streams
scaling  scalability  architecture  kafka  streams  ops 
october 2016 by jm
Charity Majors responds to the CleverTap Mongo outage war story
This is a great blog post, spot on:
You can’t just go “dudes it’s faster” and jump off a cliff.  This shit is basic.  Test real production workloads. Have a rollback plan.  (Not for *10 days* … try a month or two.)


The only thing I'd nitpick on is that it's all very well to say "buy my book" or "come see me talk at Blahcon", but a good blog post or webpage would be thousands of times more useful.
databases  stateful-services  services  ops  mongodb  charity-majors  rollback  state  storage  testing  dba 
october 2016 by jm
Airflow/AMI/ASG nightly-packaging workflow
Some tantalising discussion on twitter of an Airflow + AMI + ASG workflow for ML packaging:

'We build models using Airflow. We deploy new models as AMIs where each AMI is model + scoring code. The AMI is hence a version of code + model at a point in time : #immutable_infrastructure. It's natural for Airflow to build & deploy the model+code with each Airflow DAG Run corresponding to a versioned AMI. if there's a problem, we can simply roll back to the previous AMI & identify the problematic model building Dag run. Since we use ASGs, Airflow can execute a rolling deploy of new AMIs. We could also have it do a validation & ASG rollback of the AMI if validation fails. Airflow is being used for reliable Model build+validation+deployment.'
ml  packaging  airflow  asg  ami  deployment  ops  infrastructure  rollback 
september 2016 by jm
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
Some good slides with tips on running java apps in production in Docker
java  docker  ops  containers 
september 2016 by jm
Skyliner
Coda Hale's new gig on how they're using Docker, AWS, etc. I like this: "Use containers. Not too much. Mostly for packaging."
docker  aws  packaging  ops  devops  containers  skyliner 
september 2016 by jm
Auto Scaling for EC2 Spot Fleets
'we are enhancing the Spot Fleet model with the addition of Auto Scaling. You can now arrange to scale your fleet up and down based on a Amazon CloudWatch metric. The metric can originate from an AWS service such as EC2, Amazon EC2 Container Service, or Amazon Simple Queue Service (SQS). Alternatively, your application can publish a custom metric and you can use it to drive the automated scaling.'
asg  auto-scaling  ec2  spot-fleets  ops  scaling 
september 2016 by jm
Introducing Winston
'Event driven Diagnostic and Remediation Platform' -- aka 'runbooks as code'
runbooks  winston  netflix  remediation  outages  mttr  ops  devops 
august 2016 by jm
AWS Case Study: mytaxi
ECS, Docker, ELB, SQS, SNS, RDS, VPC, and spot instances. Pretty canonical setup these days...
The mytaxi app is also now able to predict daily and weekly spikes. In addition, it has gained the elasticity required to meet demand during special events. Herzberg describes a typical situation on New Year's Eve: “Shortly before midnight everyone needs a taxi to get to parties, and after midnight people want to go home. In past years we couldn't keep up with the demand this generated, which was around three and a half times as high as normal. In November 2015 we moved our Docker container architecture to Amazon ECS, and for the first time ever in December we were able to celebrate a new year in which our system could handle the huge number of requests without any crashes or interruptions—an accomplishment that we were extremely proud of. We had faced the biggest night on the calendar without any downtime.”
mytaxi  aws  ecs  docker  elb  sqs  sns  rds  vpc  spot-instances  ops  architecture 
august 2016 by jm
My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <rob@infinitepigeons.org>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
july 2016 by jm
Raintank investing in Graphite
paying Jason Dixon to work on it, improving the backend, possibly replacing the creaky Whisper format. great news!
graphite  metrics  monitoring  ops  open-source  grafana  raintank 
july 2016 by jm
USE Method: Linux Performance Checklist
Really late in bookmarking this, but has some up-to-date sample commandlines for sar, mpstat and iostat on linux
linux  sar  iostat  mpstat  cli  ops  sysadmin  performance  tuning  use  metrics 
june 2016 by jm
Squeezing blood from a stone: small-memory JVM techniques for microservice sidecars
Reducing service memory usage from 500MB to 105MB:
We found two specific techniques to be the most beneficial: turning off one of the two JIT compilers enabled by default (the “C2” compiler), and using a 32-bit, rather than a 64-bit, JVM.
32bit  jvm  java  ops  memory  tuning  jit  linkerd 
june 2016 by jm
Some thoughts on operating containers
R.I.Pienaar talks about the conventions he uses when containerising; looks like a decent approach.
ops  containers  docker  ripienaar  packaging 
june 2016 by jm
Green/Blue Deployments with AWS Lambda and CloudFormation - done right
Basically, use a Lambda to put all instances from an ASG into the ELB, then remove the old ASG
asg  elb  aws  lambda  deployment  ops  blue-green-deploys 
may 2016 by jm
#825394 - systemd kill background processes after user logs out - Debian Bug report logs
Systemd breaks UNIX behaviour which has been standard practice for 30 years:
It is now indeed the case that any background processes that were still
running are killed automatically when the user logs out of a session,
whether it was a desktop session, a VT session, or when you SSHed into a
machine. Now you can no longer expect a long running background processes to
continue after logging out. I believe this breaks the expectations of
many users. For example, you can no longer start a screen or tmux
session, log out, and expect to come back to it.
systemd  ops  debian  linux  fail  background  cli  commandline 
may 2016 by jm
3 Reasons AWS Lambda Is Not Ready for Prime Time
This totally matches my own preconceptions ;)
When we at Datawire tried to actually use Lambda for a real-world HTTP-based microservice [...], we found some uncool things that make Lambda not yet ready for the world we live in:

Lambda is a building block, not a tool;
Lambda is not well documented;
Lambda is terrible at error handling

Lung skips these uncool things, which makes sense because they’d make the tutorial collapse under its own weight, but you can’t skip them if you want to work in the real world. (Note that if you’re using Lambda for event handling within the AWS world, your life will be easier. But the really interesting case in the microservice world is Lambda and HTTP.)
aws  lambda  microservices  datawire  http  api-gateway  apis  https  python  ops 
may 2016 by jm
Key Metrics for Amazon Aurora | AWS Partner Network (APN) Blog
Very DataDog-oriented, but some decent tips on monitorable metrics here
datadog  metrics  aurora  aws  rds  monitoring  ops 
may 2016 by jm
raboof/nethogs: Linux 'net top' tool
NetHogs is a small 'net top' tool. Instead of breaking the traffic down per protocol or per subnet, like most tools do, it groups bandwidth by process.
nethogs  cli  networking  performance  measurement  ops  linux  top 
may 2016 by jm
CoreOS and Prometheus: Building monitoring for the next generation of cluster infrastructure
Ooh, this is a great plan. :applause:
Enabling GIFEE — Google Infrastructure for Everyone Else — is a primary mission at CoreOS, and open source is key to that goal. [....]

Prometheus was initially created to handle monitoring and alerting in modern microservice architectures. It steadily grew to fit the wider idea of cloud native infrastructure. Though it was not intentional in the original design, Prometheus and Kubernetes conveniently share the key concept of identifying entities by labels, making the semantics of monitoring Kubernetes clusters simple. As we discussed previously on this blog, Prometheus metrics formed the basis of our analysis of Kubernetes scheduler performance, and led directly to improvements in that code. Metrics are essential not just to keep systems running, but also to analyze and improve application behavior.

All things considered, Prometheus was an obvious choice for the next open source project CoreOS wanted to support and improve with internal developers committed to the code base.
monitoring  coreos  prometheus  metrics  clustering  ops  gifee  google  kubernetes 
may 2016 by jm
Amazon S3 Transfer Acceleration
The AWS edge network has points of presence in more than 50 locations. Today, it is used to distribute content via Amazon CloudFront and to provide rapid responses to DNS queries made to Amazon Route 53. With today’s announcement, the edge network also helps to accelerate data transfers in to and out of Amazon S3. It will be of particular benefit to you if you are transferring data across or between continents, have a fast Internet connection, use large objects, or have a lot of content to upload.

You can think of the edge network as a bridge between your upload point (your desktop or your on-premises data center) and the target bucket. After you enable this feature for a bucket (by checking a checkbox in the AWS Management Console), you simply change the bucket’s endpoint to the form BUCKET_NAME.s3-accelerate.amazonaws.com. No other configuration changes are necessary! After you do this, your TCP connections will be routed to the best AWS edge location based on latency.  Transfer Acceleration will then send your uploads back to S3 over the AWS-managed backbone network using optimized network protocols, persistent connections from edge to origin, fully-open send and receive windows, and so forth.
aws  s3  networking  infrastructure  ops  internet  cdn 
april 2016 by jm
Google Cloud Status
Ouch, multi-region outage:
At 14:50 Pacific Time on April 11th, our engineers removed an unused GCE IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network. By itself, this sort of change was harmless and had been performed previously without incident. However, on this occasion our network configuration management software detected an inconsistency in the newly supplied configuration. The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management. In attempting to resolve this inconsistency the network management software is designed to ‘fail safe’ and revert to its current configuration rather than proceeding with the new configuration. However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.

One of our core principles at Google is ‘defense in depth’, and Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations in the event of an upstream failure or bug. These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
multi-region  outages  google  ops  postmortems  gce  cloud  ip  networking  cascading-failures  bugs 
april 2016 by jm
Open Sourcing Dr. Elephant: Self-Serve Performance Tuning for Hadoop and Spark
[LinkedIn] are proud to announce today that we are open sourcing Dr. Elephant, a powerful tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows.


neat, although I've been bitten too many times by LinkedIn OSS release quality at this point to jump in....
linkedin  oss  hadoop  spark  performance  tuning  ops 
april 2016 by jm
AWSume
'AWS Assume Made Awesome' -- 'Here are Trek10, we work with many clients, and thus work with multiple AWS accounts on a regular (daily) basis. We needed a way to make managing all our different accounts easier. We create a standard Trek10 administrator role in our clients’ accounts that we can assume. For security we require that the role assumer have multifactor authentication enabled.'
mfa  aws  awsume  credentials  accounts  ops 
april 2016 by jm
Dan Luu reviews the Site Reliability Engineering book
voluminous! still looks great, looking forward to reading our copy (via Tony Finch)
via:fanf  books  reading  devops  ops  google  sre  dan-luu 
april 2016 by jm
s3git
git for Cloud Storage. Create distributed, decentralized and versioned repositories that scale infinitely to 100s of millions of files and PBs of storage. Huge repos can be cloned on your local SSD for making changes, committing and pushing back. Oh yeah, and it dedupes too due to BLAKE2 Tree hashing. http://s3git.org
git  ops  storage  cloud  s3  disk  aws  version-control  blake2 
april 2016 by jm
The revenge of the listening sockets
More adventures in debugging the Linux kernel:
You can't have a very large number of bound TCP sockets and we learned that the hard way. We learned a bit about the Linux networking stack: the fact that LHTABLE is fixed size and is hashed by destination port only. Once again we showed a couple of powerful of System Tap scripts.
ops  linux  networking  tcp  network  lhtable  kernel 
april 2016 by jm
Wired on the new O'Reilly SRE book
"Site Reliability Engineering: How Google Runs Production Systems", by Chris Jones, Betsy Beyer, Niall Richard Murphy, Jennifer Petoff. Go Niall!
google  sre  niall-murphy  ops  devops  oreilly  books  toread  reviews 
april 2016 by jm
Counting with domain specific databases — The Smyte Blog — Medium
whoa, pretty heavily engineered scalable counting system with Kafka, RocksDB and Kubernetes
kafka  rocksdb  kubernetes  counting  databases  storage  ops 
april 2016 by jm
A Decade Of Container Control At Google
The big thing that can be gleaned from the latest paper out of Google on its container controllers is that the shift from bare metal to containers is a profound one – something that may not be obvious to everyone seeking containers as a better way – and we think cheaper way – of doing server virtualization and driving up server utilization higher. Everything becomes application-centric rather than machine-centric, which is the nirvana that IT shops have been searching for. The workload schedulers, cluster managers, and container controllers work together to get the right capacity to the application when it needs it, whether it is a latency-sensitive job or a batch job that has some slack in it, and all that the site recovery engineers and developers care about is how the application is performing and they can easily see that because all of the APIs and metrics coming out of them collect data at the application level, not on a per-machine basis. To do this means adopting containers, period. There is no bare metal at Google, and let that be a lesson to HPC shops or other hyperscalers or cloud builders that think they need to run in bare metal mode.
google  containers  kubernetes  borg  bare-metal  ops 
april 2016 by jm
bcc
Dynamic tracing tools for Linux, a la dtrace, ktrace, etc. Built using BPF, using kernel features in the 4.x kernel series, requiring at least version 4.1 of the kernel
linux  tracing  bpf  dynamic  ops 
april 2016 by jm
Qualys SSL Server Test
pretty sure I had this bookmarked previously, but this is the current URL -- SSL/TLS quality report
ssl  tls  security  tests  ops  tools  testing 
march 2016 by jm
Charity Majors - AWS networking, VPC, environments and you
'VPC is the future and it is awesome, and unless you have some VERY SPECIFIC AND CONVINCING reasons to do otherwise, you should be spinning up a VPC per environment with orchestration and prob doing it from CI on every code commit, almost like it’s just like, you know, code.'
networking  ops  vpc  aws  environments  stacks  terraform 
march 2016 by jm
Ruby in Production: Lessons Learned — Medium
Based on the pain we've had trying to bring our Rails services up to the quality levels required, this looks pretty accurate in many respects. I'd augment this advice by saying: avoid RVM; use Docker.
rvm  docker  ruby  production  rails  ops 
march 2016 by jm
Seesaw: scalable and robust load balancing from Google
After evaluating a number of platforms, including existing open source projects, we were unable to find one that met all of our needs and decided to set about developing a robust and scalable load balancing platform. The requirements were not exactly complex - we needed the ability to handle traffic for unicast and anycast VIPs, perform load balancing with NAT and DSR (also known as DR), and perform adequate health checks against the backends. Above all we wanted a platform that allowed for ease of management, including automated deployment of configuration changes.

One of the two existing platforms was built upon Linux LVS, which provided the necessary load balancing at the network level. This was known to work successfully and we opted to retain this for the new platform. Several design decisions were made early on in the project — the first of these was to use the Go programming language, since it provided an incredibly powerful way to implement concurrency (goroutines and channels), along with easy interprocess communication (net/rpc). The second was to implement a modular multi-process architecture. The third was to simply abort and terminate a process if we ended up in an unknown state, which would ideally allow for failover and/or self-recovery.
seesaw  load-balancers  google  load-balancing  vips  anycast  nat  lbs  go  ops  networking 
january 2016 by jm
« earlier      
per page:    204080120160

related tags

2fa  10/8  16.04  32bit  accept  accidents  accounts  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  advice  agpl  airbnb  airflow  aix  alarm-fatigue  alarming  alarms  alb  alert-logic  alerting  alerts  alestic  algorithms  allspaw  alter-table  ama  amazon  ami  analysis  analytics  anomaly-detection  antarctica  anti-spam  antipatterns  anycast  ap  apache  aphyr  api  api-gateway  apis  app-engine  apt  archaius  architecture  asg  asgard  aspirations  assembly  atlas  atomic  auditd  auditing  aufs  aurora  authentication  auto-remediation  auto-scaling  automation  autoremediation  autoscaling  availability  aws  awsume  az  azure  backblaze  background  backlog  backpressure  backup  backups  banking  bare-metal  baron-schwartz  bash  basho  bastions  batch  bbc  bdb  bdb-je  bdd  beanstalk  ben-maurer  ben-treynor  benchmarking  benchmarks  best-practices  big-data  billing  bind  bit-errors  bitcoin  bitly  bitrot  blake2  blameless  bloat  blockdev  blogs  blue-green-deployments  blue-green-deploys  books  boot2docker  borg  boundary  bpf  bridge  broadcast  bryan-cantrill  bsd  btrfs  bugs  build  build-out  building  bureaucracy  byteman  c  ca  ca-7  caches  caching  calico  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  cascading-failures  case-studies  cassandra  cd  cdn  censum  certificates  certs  cfengine  cgroups  change-management  change-monitoring  changes  chaos-kong  chaos-monkey  charity-majors  charts  chatops  checkip  checklists  chef  chefspec  china  chronos  ci  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cleanup  cli  clocks  clos-networks  cloud  cloud-storage  cloudera  cloudflare  cloudfront  cloudnative  cloudwatch  cluster  clustering  clusters  cms  code-spaces  codeascraft  codedeploy  codel  coding  coes  coinbase  cold  collaboration  command-line  commandline  commercial  company  compatibility  complexity  compression  concurrency  conferences  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy-on-write  copyright  coreos  coreutils  corruption  costs  counting  coursera  cp  cpu  crash-only-software  credentials  critiques  criu  cron  crypto  cubism  culture  curl  daemon  daemons  dan-luu  danilop  dark-releases  dashboards  data  data-centers  data-corruption  data-loss  database  database-is-not-a-queue  databases  datacenter  datacenters  datadog  dataviz  datawire  dba  dbus  ddl  debian  debriefing  debug  debugging  decay  defrag  delete  delivery  delta  demo  dependencies  deploy  deployinator  deployment  derp  design  desktops  dev  developers  development  deviance  devops  dht  diagnosis  digital-ocean  disk  disk-space  disks  distcomp  distributed  distributed-cron  distributed-systems  distros  diy  dmca  dns  dnsmasq  docker  documentation  dotcloud  downtime  drivers  dropbox  dstat  duplicity  duply  dynalite  dynamic  dynamic-configuration  dynamodb  dynect  ebooks  ebs  ec2  ecs  efficiency  elastic-scaling  elasticache  elasticsearch  elb  email  emr  emrfs  encryption  engineering  ensemble  environments  erasure-coding  ergonomics  error-budget  etcd  etl  etsy  eureka  ev  event-management  eventual-consistency  exception-handling  exercises  exponential-decay  ext3  ext4  extortion  fabric  fabrics  facebook  facette  fail  failover  failure  false-positives  fault-tolerance  fcron  feature-flags  fedora  file-transfer  filesystems  fincore  firefighting  five-whys  flapjack  flavour-of-the-month  flock  flow-logs  forecasting  foursquare  freebsd  front-ends  frontline  fs  fsync  ftrace  fuse  g1  g1gc  gae  game-days  games  gating  gc  gce  gcp  genomics  gifee  gil-tene  gilt  gilt-groupe  git  github  gnome  go  god  google  gossip  grafana  graphing  graphite  graphs  gruffalo  gulp  gzip  ha  hacks  hadoop  hailo  haproxy  hardware  hbase  hdds  hdfs  heap  heartbeats  heka  hero-coder  hero-culture  heron  hiccups  hidden-costs  history  hn  holt-winters  home  honeypot  horizon-charts  horror  hosting  hotels  hotspot  howto  hrd  http  http2  httpry  https  huge-pages  human-factors  humor  hvm  hyperthreading  hystrix  iam  ian-wilkes  ibm  icecube  ifttt  images  imaging  inactivity  incident-models  incident-response  incidents  increment-mag  indexes  inept  influxdb  infrastructure  init  injection  insane  inspeqtor  instapaper  instrumentation  integration-testing  integration-tests  internet  internet-scale  interviews  inviso  io  iops  iostat  ioutil  ip  ip-addresses  iptables  ironfan  james-hamilton  java  javascript  jay-kreps  jcmd  jdk  jemalloc  jenkins  jepsen  jit  jmx  jmxtrans  john-allspaw  journalling  joyent  jstat  julia-evans  juniper  jvm  kafka  kdd  kde  kellabyte  kernel  key-distribution  key-rotation  key-value  keybox  keys  keywhiz  kill-9  knife  kubernetes  lambda  languages  laptops  latency  lbs  leap-second  leap-smearing  legacy  leveldb  lhm  lhtable  libc  librato  lifespan  limits  linden  linkedin  linkerd  links  linode  linux  listen-backlog  live  lmax  load  load-balancers  load-balancing  load-testing  locking  logentries  logging  loggly  loose-coupling  lsb  lsof  lsx  luks  lxc  mac  machine-learning  macosx  madvise  mail  maintainance  mandos  manta  map-reduce  mapreduce  measurement  measurements  mechanical-sympathy  memory  mesos  metrics  mfa  microservices  microsoft  migration  migrations  mincore  mirroring  mit  ml  mmap  mocha  money  mongodb  monit  monitorama  monitoring  movies  mozilla  mpstat  mtbf  mttr  multi-region  multiplexing  mysql  mytaxi  nagios  namespaces  nannies  nas  nat  natwest  nerve  netdata  netflix  nethogs  netspot  netstat  netty  network  network-monitoring  network-partitions  networking  networks  new-relic  nginx  niall-murphy  nix  nixos  nixpkgs  node.js  normalization-of-deviance  norms  nosql  notification  notifications  npm  ntp  ntpd  nuclear-power  nurse  oauth  obama  omega  omniti  on-call  oom  oom-killer  ooms  open-source  openjdk  operability  operations  ops  opsgenie  optimization  oreilly  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outbrain  outsourcing  overhead  overlay  overlayfs  owasp  packaging  packet-capture  packets  page-cache  pager  pager-duty  pagerduty  pages  paging  papers  parse  partition  partitions  passenger  patterns  paxos  pbailis  pcp  pcp2graphite  pdf  peering  percentiles  percona  performance  php  phusion  pie  pillar  pinball  pinterest  piops  pipelines  pixar  pki  planning  platform  platforms  playbooks  plumbr.eu  post-mortem  post-mortems  postgres  postmortem  postmortems  presentation  presentations  pricing  princess  prioritisation  procedures  process  processes  procfs  prod  production  profiling  programming  prometheus  provisioning  proxies  proxy  proxying  pty  puppet  pv  python  qa  qdisc  qed-regime  questions  queueing  rabbitmq  race-conditions  rafe-colburn  raid  rails  raintank  rami-rosen  randomization  ranking  rant  rate-limiting  rbs  rc3  rdbms  rds  read-only  reading  real-time  records  recovery  red-hat  reddit  redis  redshift  refactoring  reference  registry  regression-testing  reinvent  release  releases  reliability  reliabilty  remediation  replicas  replication  request-routing  resiliency  resource-limits  restarting  restoring  rethinkdb  reverse-proxy  reversibility  reviews  rewrites  riak  riemann  ripienaar  risks  rkt  rm-rf  rmi  rob-ewaschuk  rocket  rocksdb  rollback  root-cause  root-causes  route53  routing  rspec  ruby  rules-of-thumb  runbooks  runit  runjop  rvm  rwasa  s3  s3funnel  s3ql  safety  sandboxing  sanity-checks  sar  scala  scalability  scale  scaling  scary  scheduler  scheduling  schema  scripts  sdd  sdn  seagate  search  secrets  security  seesaw  segment  sensu  serf  serialization  server  serverless  servers  serverspec  service-discovery  service-metrics  service-registry  services  ses  sev1  severity  sharding  shippable  shodan  shopify  shorn-writes  signalfx  silos  sjk  skyliner  slack  slashdot  sleep  slew  slides  smartstack  smoke-tests  smtp  snappy  snapshots  sns  soa  sockets  software  solaris  soundcloud  south-pole  space  spark  sparkey  spdy  speculative-execution  spinnaker  split-brain  spot-fleet  spot-fleets  spot-instances  spotify  sql  sqs  square  sre  ssd  ssh  ssl  stack  stack-size  stackoverflow  stacks  stackshare  staging  startup  state  stateful-services  statistics  stats  statsd  statsite  stephanie-dean  stepping  storage  stores  storm  strace  stratus  stream-processing  streaming  streams  stress-testing  strider  stripe  supervision  supervisord  support  survey  svctm  swarm  switching  syadmin  symantec  synapse  sysadmin  sysadvent  syscalls  sysdig  syslog  sysstat  system  system-testing  system-v  systemd  systems  tahoe-lafs  talks  tc  tcp  tcpcopy  tcpdump  tdd  teams  tech  tech-debt  technical-debt  techops  tee  telefonica  telemetry  teleport  terraform  testing  tests  thp  threadpools  threads  three-mile-island  throughput  thundering-herd  tier-one-support  tildeslash  time  time-machine  time-series  time-synchronization  timeouts  tips  tls  tools  top  toread  tos  trace  tracer-requests  tracing  trading  traefik  training  transactional-updates  transparent-huge-pages  trivago  troubleshooting  tsd  tuning  turing-complete  twilio  twisted  twitter  two-factor-authentication  uat  ubuntu  ubuntu-core  ui  ulster-bank  ultradns  unbound  unicorn  unikernels  unit-testing  unit-tests  unix  upgrades  upstart  uptime  urls  use  uselessd  usenix  user-submitted-code  ux  vagrant  varnish  vector  version-control  versioning  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:eoinbrazil  via:fanf  via:feylya  via:filippo  via:highscalability  via:irldexter  via:jgilbert  via:jk  via:kragen  via:lusis  via:marc  via:markkenny  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  vips  virtualisation  virtualization  visualisation  vividcortex  vm  vms  voldemort  vpc  weave  web  web-services  webmail  weighting  whats-my-ip  wifi  wiki  winston  wipac  wireless  wishlist  wlan  work  workflow  workflows  workplaces  x86_64  xen  xfs  xooglers  yahoo  yammer  yelp  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark:



description:


tags: