jm + monitoring   67

'Monitoring Cloudflare's planet-scale edge network with Prometheus' (preso)
from SRECon EMEA 2017; how Cloudflare are replacing Nagios with Prometheus and grafana
metrics  monitoring  alerting  prometheus  grafana  nagios 
8 weeks ago by jm
Linux Load Averages: Solving the Mystery
Nice bit of OS archaeology by Brendan Gregg.
In 1993, a Linux engineer found a nonintuitive case with load averages, and with a three-line patch changed them forever from "CPU load averages" to what one might call "system load averages." His change included tasks in the uninterruptible state, so that load averages reflected demand for disk resources and not just CPUs. These system load averages count the number of threads working and waiting to work, and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves.
load  monitoring  linux  unix  performance  ops  brendan-gregg  history  cpu 
august 2017 by jm
ctop
Top for containers (ie Docker)
docker  containers  top  ops  go  monitoring  cpu 
march 2017 by jm
Beringei: A high-performance time series storage engine | Engineering Blog | Facebook Code
Beringei is different from other in-memory systems, such as memcache, because it has been optimized for storing time series data used specifically for health and performance monitoring. We designed Beringei to have a very high write rate and a low read latency, while being as efficient as possible in using RAM to store the time series data. In the end, we created a system that can store all the performance and monitoring data generated at Facebook for the most recent 24 hours, allowing for extremely fast exploration and debugging of systems and services as we encounter issues in production.

Data compression was necessary to help reduce storage overhead. We considered several existing compression schemes and rejected the techniques that applied only to integer data, used approximation techniques, or needed to operate on the entire dataset. Beringei uses a lossless streaming compression algorithm to compress points within a time series with no additional compression used across time series. Each data point is a pair of 64-bit values representing the timestamp and value of the counter at that time. Timestamps and values are compressed separately using information about previous values. Timestamp compression uses a delta-of-delta encoding, so regular time series use very little memory to store timestamps.

From analyzing the data stored in our performance monitoring system, we discovered that the value in most time series does not change significantly when compared to its neighboring data points. Further, many data sources only store integers (despite the system supporting floating point values). Knowing this, we were able to tune previous academic work to be easier to compute by comparing the current value with the previous value using XOR, and storing the changed bits. Ultimately, this algorithm resulted in compressing the entire data set by at least 90 percent.
beringei  compression  facebook  monitoring  tsd  time-series  storage  architecture 
february 2017 by jm
Is Pokémon GO down or not?
(a) 83.5% uptime over 24 hours. GOOD JOB

(b) excellent marketing by Datadog!
datadog  games  monitoring  pokemon-go  pokemon  uptime 
july 2016 by jm
My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <rob@infinitepigeons.org>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
july 2016 by jm
Raintank investing in Graphite
paying Jason Dixon to work on it, improving the backend, possibly replacing the creaky Whisper format. great news!
graphite  metrics  monitoring  ops  open-source  grafana  raintank 
july 2016 by jm
Key Metrics for Amazon Aurora | AWS Partner Network (APN) Blog
Very DataDog-oriented, but some decent tips on monitorable metrics here
datadog  metrics  aurora  aws  rds  monitoring  ops 
may 2016 by jm
CoreOS and Prometheus: Building monitoring for the next generation of cluster infrastructure
Ooh, this is a great plan. :applause:
Enabling GIFEE — Google Infrastructure for Everyone Else — is a primary mission at CoreOS, and open source is key to that goal. [....]

Prometheus was initially created to handle monitoring and alerting in modern microservice architectures. It steadily grew to fit the wider idea of cloud native infrastructure. Though it was not intentional in the original design, Prometheus and Kubernetes conveniently share the key concept of identifying entities by labels, making the semantics of monitoring Kubernetes clusters simple. As we discussed previously on this blog, Prometheus metrics formed the basis of our analysis of Kubernetes scheduler performance, and led directly to improvements in that code. Metrics are essential not just to keep systems running, but also to analyze and improve application behavior.

All things considered, Prometheus was an obvious choice for the next open source project CoreOS wanted to support and improve with internal developers committed to the code base.
monitoring  coreos  prometheus  metrics  clustering  ops  gifee  google  kubernetes 
may 2016 by jm
Observability at Twitter: technical overview, part II
Interesting to me mainly for this tidbit which makes my own prejudices:
“Pull” vs “push” in metrics collection: At the time of our previous blog post, all our metrics were collected by “pulling” from our collection agents. We discovered two main issues:

* There is no easy way to differentiate service failures from collection agent failures. Service response time out and missed collection request are both manifested as empty time series.
* There is a lack of service quality insulation in our collection pipeline. It is very difficult to set an optimal collection time out for various services. A long collection time from one single service can cause a delay for other services that share the same collection agent.

In light of these issues, we switched our collection model from “pull” to “push” and increased our service isolation. Our collection agent on each host only collects metrics from services running on that specific host. Additionally, each collection agent sends separate collection status tracking metrics in addition to the metrics emitted by the services.

We have seen a significant improvement in collection reliability with these changes. However, as we moved to self service push model, it becomes harder to project the request growth. In order to solve this problem, we plan to implement service quota to address unpredictable/unbounded growth.
pull  push  metrics  tcp  stacks  monitoring  agents  twitter  fault-tolerance 
march 2016 by jm
Life360 testimonial for Prometheus
Now this is a BIG thumbs up:
'Prometheus has been known to us for a while, and we have been tracking it and reading about the active development, and at a point (a few months back) we decided to start evaluating it for production use. The PoC results were incredible. The monitoring coverage of MySQL was amazing, and we also loved the JMX monitoring for Cassandra, which had been sorely lacking in the past.'
metrics  monitoring  time-series  prometheus  testimonials  life360  cassandra  jmx  mysql 
march 2016 by jm
Heroic
Spotify wrote their own metrics store on ElasticSearch and Cassandra. Sounds very similar to Prometheus
cassandra  elasticsearch  spotify  monitoring  metrics  heroic 
december 2015 by jm
Why Percentiles Don’t Work the Way you Think
Baron Schwartz on metrics, percentiles, and aggregation. +1, although as a HN commenter noted, quantile digests are probably the better fix
performance  percentiles  quantiles  statistics  metrics  monitoring  baron-schwartz  vividcortex 
december 2015 by jm
Floating car data
Floating car data (FCD), also known as floating cellular data, is a method to determine the traffic speed on the road network. It is based on the collection of localization data, speed, direction of travel and time information from mobile phones in vehicles that are being driven. These data are the essential source for traffic information and for most intelligent transportation systems (ITS). This means that every vehicle with an active mobile phone acts as a sensor for the road network. Based on these data, traffic congestion can be identified, travel times can be calculated, and traffic reports can be rapidly generated. In contrast to traffic cameras, number plate recognition systems, and induction loops embedded in the roadway, no additional hardware on the road network is necessary.
surveillance  cars  driving  mobile-phones  phones  travel  gsm  monitoring  anpr  alpr  traffic 
november 2015 by jm
Nobody Loves Graphite Anymore - VividCortex
Graphite has a place in our current monitoring stack, and together with StatsD will always have a special place in the hearts of DevOps practitioners everywhere, but it’s not representative of state-of-the-art in the last few years. Graphite is where the puck was in 2010. If you’re skating there, you’re missing the benefits of modern monitoring infrastructure.

The future I foresee is one where time series capabilities (the raw power needed, which I described in my time series requirements blog post, for example) are within everyone’s reach. That will be considered table stakes, whereas now it’s pretty revolutionary.


Like I've been saying -- we need Time Series As A Service! This should be undifferentiated heavy lifting.
graphite  tsd  time-series  vividcortex  statsd  ops  monitoring  metrics 
november 2015 by jm
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
Painful to read, but: tl;dr: monitoring oversight, followed by a transient network glitch triggering IPC timeouts, which increased load due to lack of circuit breakers, creating a cascading failure
aws  postmortem  outages  dynamodb  ec2  post-mortems  circuit-breakers  monitoring 
september 2015 by jm
Anatomy of a Modern Production Stack
Interesting post, but I think it falls into a common trap for the xoogler or ex-Amazonian -- assuming that all the BigCo mod cons are required to operate, when some are luxuries than can be skipped for a few years to get some real products built
architecture  ops  stack  docker  containerization  deployment  containers  rkt  coreos  prod  monitoring  xooglers 
september 2015 by jm
Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library - AWS Big Data Blog
Good advice on production-quality, decent-scale usage of Kinesis in Java with the official library: batching, retries, partial failures, backoff, and monitoring. (Also, jaysus, the AWS Cloudwatch API is awful, looking at this!)
kpl  aws  kinesis  tips  java  batching  streaming  production  cloudwatch  monitoring  coding 
august 2015 by jm
Introducing Nurse: Auto-Remediation at LinkedIn
Interesting to hear about auto-remediation in prod -- we built a (very targeted) auto-remediation system in Amazon on the Network Monitoring team, but this is much bigger in focus
nurse  auto-remediation  outages  linkedin  ops  monitoring 
august 2015 by jm
sjk
a command line tool for JVM diagnostic troubleshooting and profiling.
java  jvm  monitoring  commandline  jmx  sjk  tools  ops 
june 2015 by jm
Internet Scale Services Checklist
good aspirational checklist, inspired heavily by James Hamilton's seminal 2007 paper, "On Designing And Deploying Internet-Scale Services"
james-hamilton  checklists  ops  internet-scale  architecture  operability  monitoring  reliability  availability  uptime  aspirations 
april 2015 by jm
Introducing Vector: Netflix's On-Host Performance Monitoring Tool
It gives pinpoint real-time performance metric visibility to engineers working on specific hosts -- basically sending back system-level performance data to their browser, where a client-side renderer turns it into a usable dashboard. Essentially the idea is to replace having to ssh onto instances, run "top", systat, iostat, and so on.
vector  netflix  performance  monitoring  sysstat  top  iostat  netstat  metrics  ops  dashboards  real-time  linux 
april 2015 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
Heka
an open source stream processing software system developed by Mozilla. Heka is a “Swiss Army Knife” type tool for data processing, useful for a wide variety of different tasks, such as:

Loading and parsing log files from a file system.
Accepting statsd type metrics data for aggregation and forwarding to upstream time series data stores such as graphite or InfluxDB.
Launching external processes to gather operational data from the local system.
Performing real time analysis, graphing, and anomaly detection on any data flowing through the Heka pipeline.
Shipping data from one location to another via the use of an external transport (such as AMQP) or directly (via TCP).
Delivering processed data to one or more persistent data stores.


Via feylya on twitter. Looks potentially nifty
heka  mozilla  monitoring  metrics  via:feylya  ops  statsd  graphite  stream-processing 
march 2015 by jm
One year of InfluxDB and the road to 1.0
half of the [Monitorama] attendees were employees and entrepreneurs at monitoring, metrics, DevOps, and server analytics companies. Most of them had a story about how their metrics API was their key intellectual property that took them years to develop. The other half of the attendees were developers at larger organizations that were rolling their own DevOps stack from a collection of open source tools. Almost all of them were creating a “time series database” with a bunch of web services code on top of some other database or just using Graphite. When everyone is repeating the same work, it’s not key intellectual property or a differentiator, it’s a barrier to entry. Not only that, it’s something that is hindering innovation in this space since everyone has to spend their first year or two getting to the point where they can start building something real. It’s like building a web company in 1998. You have to spend millions of dollars and a year building infrastructure, racking servers, and getting everything ready before you could run the application. Monitoring and analytics applications should not be like this.
graphite  monitoring  metrics  tsd  time-series  analytics  influxdb  open-source 
february 2015 by jm
Backstage Blog - Prometheus: Monitoring at SoundCloud - SoundCloud Developers
whoa, this is pretty excellent. The major improvement over a graphite-based system would be the multi-dimensional tagging of metrics, which we currently have to do by simply expanding the graphite metric's name to encompass all those dimensions and use searching at query time, inefficiently.
monitoring  soundcloud  prometheus  metrics  service-metrics  graphite  alerting 
february 2015 by jm
StatusPage.io
'Hosted Status Pages for Your Company'. We use these guys in $work, and their service is fantastic -- it's a line of javascript in the page template which will easily allow you to add a "service degraded" banner when things go pear-shaped, along with an external status site for when things get really messy. They've done a good clean job.
monitoring  server  status  outages  uptime  saas  infrastructure 
november 2014 by jm
Applying cardiac alarm management techniques to your on-call
An ops-focused take on a recent story about alarm fatigue, and how a Boston hospital dealt with it. When I was in Amazon, many of the teams in our division had a target to reduce false positive pages, with a definite monetary value attached to it, since many teams had "time off in lieu" payments for out-of-hours pages to the on-call staff. As a result, reducing false-positive pages was reasonably high priority and we dealt with this problem very proactively, with a well-developed sense of how to do so. It's interesting to see how the outside world is only just starting to look into its amelioration. (Another benefit of a TOIL policy ;)
ops  monitoring  sysadmin  alerts  alarms  nagios  alarm-fatigue  false-positives  pages 
september 2014 by jm
Profiling Hadoop jobs with Riemann
I’ve built a very simple distributed profiler for soft-real-time telemetry from hundreds to thousands of JVMs concurrently. It’s nowhere near as comprehensive in its analysis as, say, Yourkit, but it can tell you, across a distributed system, which functions are taking the most time, and what their dominant callers are.


Potentially useful.
riemann  profiling  aphyr  hadoop  emr  performance  monitoring 
august 2014 by jm
REST Commander: Scalable Web Server Management and Monitoring
We dynamically monitor and manage a large and rapidly growing number of web servers deployed on our infrastructure and systems. However, existing tools present major challenges when making REST/SOAP calls with server-specific requests to a large number of web servers, and then performing aggregated analysis on the responses. We therefore developed REST Commander, a parallel asynchronous HTTP client as a service to monitor and manage web servers. REST Commander on a single server can send requests to thousands of servers with response aggregation in a matter of seconds. And yes, it is open-sourced at http://www.restcommander.com.

Feature highlights:

Click-to-run with zero installation;
Generic HTTP request template supporting variable-based replacement for sending server-specific requests;
Ability to send the same request to different servers, different requests to different servers, and different requests to the same server;
Maximum concurrency control (throttling) to accommodate server capacity;
Commander itself is also “as a service”: with its powerful REST API, you can define ad-hoc target servers, an HTTP request template, variable replacement, and a regular expression all in a single call. In addition, intuitive step-by-step wizards help you achieve the same functionality through a GUI.
rest  http  clients  load-testing  ebay  soap  async  testing  monitoring 
july 2014 by jm
mariusaeriksen/heapster
Heapster provides an agent library to do heap profiling for JVM processes with output compatible with Google perftools. The goal of Heapster is to be able to do meaningful (sampled) heap profiling in a production setting.


Used by Twitter in production, apparently.
heap  monitoring  memory  jvm  java  performance 
july 2014 by jm
Metrics-Driven Development
we believe MDD is equal parts engineering technique and cultural process. It separates the notion of monitoring from its traditional position of exclusivity as an operations thing and places it more appropriately next to its peers as an engineering process. Provided access to real-time production metrics relevant to them individually, both software engineers and operations engineers can validate hypotheses, assess problems, implement solutions, and improve future designs.


Broken down into the following principles: 'Instrumentation-as-Code', 'Single Source of Truth', 'Developers Curate Visualizations and Alerts', 'Alert on What You See', 'Show me the Graph', 'Don’t Measure Everything (YAGNI)'.

We do all of these at Swrve, naturally (a technique I happily stole from Amazon).
metrics  coding  graphite  mdd  instrumentation  yagni  alerting  monitoring  graphs 
july 2014 by jm
Boundary's new server monitoring free offering
'High resolution, 1 second intervals for all metrics; Fluid analytics, drag any graph to any point in time; Smart alarms to cut down on false positives; Embedded graphs and customizable dashboards; Up to 10 servers for free'

Pre-registration is open now. Could be interesting, although the limit of 10 machines is pretty small for any production usage
boundary  monitoring  network  ops  metrics  alarms  tcp  ip  netstat 
july 2014 by jm
interview with Google VP of SRE Ben Treynor
interviewed by Niall Murphy, no less ;). Some good info on what Google deems important from an ops/SRE perspective
sre  ops  devops  google  monitoring  interviews  ben-treynor 
may 2014 by jm
Dead Man's Snitch
a cron job monitoring tool that keeps an eye on your periodic processes and notifies you when something doesn't happen. Daily backups, monthly emails, or cron jobs you need to monitor? Dead Man's Snitch has you covered. Know immediately when one of these processes doesn't work.


via Marc.
alerts  cron  monitoring  sysadmin  ops  backups  alarms 
april 2014 by jm
Traffic Graph – Google Transparency Report
this is cool. Google are exposing an aggregated 'all services' hit count time-series graph, broken down by country, as part of their Transparency Report pages
transparency  filtering  web  google  http  graphs  monitoring  syria 
february 2014 by jm
10 Things We Forgot to Monitor
a list of not-so-common outage causes which are easy to overlook; swap rate, NTP drift, SSL expiration, fork rate, etc.
nagios  metrics  ops  monitoring  systems  ntp  bitly 
january 2014 by jm
Twitter tech talk video: "Profiling Java In Production"
In this talk Kaushik Srenevasan describes a new, low overhead, full-stack tool (based on the Linux perf profiler and infrastructure built into the Hotspot JVM) we've built at Twitter to solve the problem of dynamically profiling and tracing the behavior of applications (including managed runtimes) in production.


Looks very interesting. Haven't watched it yet though
twitter  tech-talks  video  presentations  java  jvm  profiling  testing  monitoring  service-metrics  performance  production  hotspot  perf 
december 2013 by jm
"What Should I Monitor?"
slides (lots of slides) from Baron Schwartz' talk at Velocity in NYC.
slides  monitoring  metrics  ops  devops  baron-schwartz  pdf  capacity 
october 2013 by jm
Observability at Twitter
Bit of detail into Twitter's TSD metric store.
There are separate online clusters for different data sets: application and operating system metrics, performance critical write-time aggregates, long term archives, and temporal indexes. A typical production instance of the time series database is based on four distinct Cassandra clusters, each responsible for a different dimension (real-time, historical, aggregate, index) due to different performance constraints. These clusters are amongst the largest Cassandra clusters deployed in production today and account for over 500 million individual metric writes per minute. Archival data is stored at a lower resolution for trending and long term analysis, whereas higher resolution data is periodically expired. Aggregation is generally performed at write-time to avoid extra storage operations for metrics that are expected to be immediately consumed. Indexing occurs along several dimensions–service, source, and metric names–to give users some flexibility in finding relevant data.
twitter  monitoring  metrics  service-metrics  tsd  time-series  storage  architecture  cassandra 
september 2013 by jm
DevOps Eye for the Coding Guy: Metrics
a pretty good description of the process of adding service metrics to a Django webapp using graphite and statsd. Bookmarking mainly for the great real-time graphing hack at the end...
statsd  django  monitoring  metrics  python  graphite 
september 2013 by jm
Project Voldemort: measuring BDB space consumption
HOWTO measure this using the BDB-JE command line tools. this is exposed through JMX as the CleanerBacklog metric, too, I think, but good to bookmark just in case
voldemort  cleaner  bdb  ops  space  storage  monitoring  debug 
june 2013 by jm
Paper: "Root Cause Detection in a Service-Oriented Architecture" [pdf]
LinkedIn have implemented an automated root-cause detection system:

This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean
average precision in finding root causes compared to baseline and current state-of-the-art methods.


This is a topic close to my heart after working on something similar for 3 years in Amazon!

Looks interesting, although (a) I would have liked to see more case studies and examples of "real world" outages it helped with; and (b) it's very much a machine-learning paper rather than a systems one, and there is no discussion of fault tolerance in the design of the detection system, which would leave me worried that in the case of a large-scale outage event, the system itself will disappear when its help is most vital. (This was a major design influence on our team's work.)

Overall, particularly given those 2 issues, I suspect it's not in production yet. Ours certainly was ;)
linkedin  soa  root-cause  alarming  correlation  service-metrics  machine-learning  graphs  monitoring 
june 2013 by jm
Introducing Kale « Code as Craft
Etsy have implemented a tool to perform auto-correlation of service metrics, and detection of deviation from historic norms:
at Etsy, we really love to make graphs. We graph everything! Anywhere we can slap a StatsD call, we do. As a result, we’ve found ourselves with over a quarter million distinct metrics. That’s far too many graphs for a team of 150 engineers to watch all day long! And even if you group metrics into dashboards, that’s still an awful lot of dashboards if you want complete coverage. Of course, if a graph isn’t being watched, it might misbehave and no one would know about it. And even if someone caught it, lots of other graphs might be misbehaving in similar ways, and chances are low that folks would make the connection.

We’d like to introduce you to the Kale stack, which is our attempt to fix both of these problems. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.


It'll be interesting to see if they can get this working well. I've found it can be tricky to get working with low false positives, without massive volume to "smooth out" spikes caused by normal activity. Amazon had one particularly successful version driving severity-1 order drop alarms, but it used massive event volumes and still had periodic false positives. Skyline looks like it will alarm on a single anomalous data point, and in the comments Abe notes "our algorithms err on the side of noise and so alerting would be very noisy."
etsy  monitoring  service-metrics  alarming  deviation  correlation  data  search  graphs  oculus  skyline  kale  false-positives 
june 2013 by jm
graphite-metrics
metric collectors for various stuff not (or poorly) handled by other monitoring daemons

Core of the project is a simple daemon (harvestd), which collects metric values and sends them to graphite carbon daemon (and/or other configured destinations) once per interval. Includes separate data collection components ("collectors") for processing of:

/proc/slabinfo for useful-to-watch values, not everything (configurable).
/proc/vmstat and /proc/meminfo in a consistent way.
/proc/stat for irq, softirq, forks.
/proc/buddyinfo and /proc/pagetypeinfo (memory fragmentation).
/proc/interrupts and /proc/softirqs.
Cron log to produce start/finish events and duration for each job into a separate metrics, adapts jobs to metric names with regexes.
Per-system-service accounting using systemd and it's cgroups.
sysstat data from sadc logs (use something like sadc -F -L -S DISK -S XDISK -S POWER 60 to have more stuff logged there) via sadf binary and it's json export (sadf -j, supported since sysstat-10.0.something, iirc).
iptables rule "hits" packet and byte counters, taken from ip{,6}tables-save, mapped via separate "table chain_name rule_no metric_name" file, which should be generated along with firewall rules (I use this script to do that).


Pretty exhaustive list of system metrics -- could have some interesting ideas for Linux OS-level metrics to monitor in future.
graphite  monitoring  metrics  unix  linux  ops  vm  iptables  sysadmin 
june 2013 by jm
My Philosophy on Alerting
'based on my observations while I was a Site Reliability Engineer at Google.' - by Rob Ewaschuk; very good, and matching the similar recommendations and best practices at Amazon for that matter
monitoring  ops  devops  alerting  alerts  pager-duty  via:jk 
may 2013 by jm
Monitoring the Status of Your EBS Volumes
Page in the AWS docs which describes their derived metrics and how they are computed -- these are visible in the AWS Management Console, and alarmable, but not viewable in the Cloudwatch UI. grr. (page-joshea!)
ebs  aws  monitoring  metrics  ops  documentation  cloudwatch 
may 2013 by jm
Measure Anything, Measure Everything « Code as Craft
the classic Etsy pro-metrics "measure everything" post. Some good basic rules and mindset
etsy  monitoring  metrics  stats  ops  devops 
april 2013 by jm
paperplanes. Monitoring for Humans
A good contemplation of the state of ops monitoring, post-#monitorama. At one point, he contemplates the concept of automated anomaly detection:
This leads to another interesting question: if I need to create activity to measure it, and if my monitoring system requires me to generate this activity to be able to put a graph and an alert on it, isn't my monitoring system wrong? Are all the monitoring systems wrong? [...]

We spend an eternity looking at graphs, right after an alert was triggered because a certain threshold was crossed. Does that alert even mean anything, is it important right now? It's where a human operator still has to decide if it's worth the trouble or if they should just ignore the alert. As much as I enjoy staring at graphs, I'd much rather do something more important than that.

I'd love for my monitoring system to be able to tell me that something out of the ordinary is currently happening. It has all the information at hand to make that decision at least with a reasonable probability.


I like the concept of Holt-Winters-style forecasting and confidence bands etc., but my experience is that the reality is that anomalies often aren't sufficiently bad news -- ie. when an anomalous event occurs, it may not indicate an outage. Anomaly detection is hard to turn into a reliable alarm. Having said that, I have seen it done (and indeed our team has done it!) where there is sufficiently massive volume to smooth out the "normal" anomalies, and leave real signs of impact.

Still, this is something that Baron Schwartz (ex-Percona) has been talking about too, so there are some pretty smart people thinking about it and it has a bright future.
monitoring  networks  holt-winters  forecasting  confidence-bands  anomaly-detection  ops  monitorama  baron-schwartz  false-positives 
march 2013 by jm
The first pillar of agile sysadmin: We alert on what we draw
'One of [the] purposes of monitoring systems was to provide data to allow us, as engineers, to detect patterns, and predict issues before they become production impacting. In order to do this, we need to be capturing data and storing it somewhere which allows us to analyse it. If we care about it - if the data could provide the kind of engineering insight which helps us to understand our systems and give early warning - we should be capturing it. ' .... 'There are a couple of weaknesses in [Nagios' design]. Assuming we’ve agreed that if we care about a metric enough to want to alert on it then we should be gathering that data for analysis, and graphing it, then we already have the data upon which to base our check. Furthermore, this data is not on the machine we’re monitoring, so our checks don’t in any way add further stress to that machine.' I would add that if we are alerting on a different set of data from what we collect for graphing, then using the graphs to investigate an alarm may run into problems if they don't sync up.
devops  monitoring  deployment  production  sysadmin  ops  alerting  metrics 
march 2013 by jm
Monitoring Apache Hadoop, Cassandra and Zookeeper using Graphite and JMXTrans
nice enough, but a lot of moving parts. It would be nice to see a simpler ZK+Graphite setup using the 'mntr' verb
graphite  monitoring  ops  zookeeper  cassandra  hadoop  jmx  jmxtrans  graphs 
march 2013 by jm
Heroku finds out that distributed queueing is hard
Stage 3 of the Rap Genius/Heroku blog drama. Summary (as far as I can tell): Heroku gave up on a fully-synchronised load-balancing setup ("intelligent routing"), since it didn't scale, in favour of randomised queue selection; they didn't sufficiently inform their customers, and metrics and docs were not updated to make this change public; the pessimal case became pretty damn pessimal; a customer eventually noticed and complained publicly, creating a public shit-storm.

Comments: 1. this is why you monitor real HTTP request latency (scroll down for crazy graphs!). 2. include 90/99 percentiles to catch the "tail" of poorly-performing requests. 3. Load balancers are hard.

http://aphyr.com/posts/277-timelike-a-network-simulator has more info on the intricacies of distributed load balancing -- worth a read.
heroku  rap-genius  via:hn  networking  distcomp  distributed  load-balancing  ip  queueing  percentiles  monitoring 
february 2013 by jm
Passively Monitoring Network Round-Trip Times - Boundary
'how Boundary uses [TCP timestamps] to calculate round-trip times (RTTs) between any two hosts by passively monitoring TCP traffic flows, i.e., without actively launching ICMP echo requests (pings). The post is primarily an overview of this one aspect of TCP monitoring, it also outlines the mechanism we are using, and demonstrates its correctness.'
tcp  boundary  monitoring  network  ip  passive-monitoring  rtt  timestamping 
february 2013 by jm
paperplanes. The Virtues of Monitoring, Redux
A rather vague and touchy-feely "state of the union" post on monitoring. Good set of links at the end, though; I like the look of Sensu and Tasseo, but am still unconvinced about the value of Boundary's offering
monitoring  metrics  ops 
january 2013 by jm
Java tip: How to get CPU, system, and user time for benchmarking
a neat MXBean trick to get per-thread CPU usage in a running JVM (via Tatu Saloranta)
java  jvm  monitoring  cpu  metrics  threads 
november 2012 by jm
Facebook monitoring cache with Claspin
reasonably nice heatmap viz for large-scale instance monitoring. I like the "snake" pattern for racks
facebook  monitoring  dataviz  heatmaps  claspin  cache  memcached  ui 
september 2012 by jm
[tahoe-dev] erasure coding makes files more fragile, not less
Zooko says: "This monitoring and operations engineering is a lot of work!" amen to that
erasure-coding  replicas  fs  tahoe-lafs  zooko  monitoring  devops  ops 
march 2012 by jm
DTrace and Erlang
from Basho, via istvan. DTrace is becoming more compelling as a deep instrumentation/monitoring API -- I didn't realise disabled DTrace probes were virtually 0-overhead (a "2 NOOP instruction placeholder", apparently), that's nifty. Wonder if they've fixed the licensing mess, though
dtrace  monitoring  instrumentation  debugging  tracing  unix  erlang  via:istvan 
november 2011 by jm
Etsy's metrics infrastructure
I never really understood how useful a good metrics infrastructure could be for operational visibility until I joined Amazon.  Here's a good demo of Etsy's metrics system (via Netlson)
via:nelson  metrics  deployment  change-monitoring  etsy  software  monitoring  ops  from delicious
december 2010 by jm
ioprofile
wraps strace(1) to summarise and aggregate I/O ops performed by a Linux process. looks pretty nifty (via Jeremy Zawodny)
via:jzawodny  io  strace  linux  monitoring  debugging  performance  profiling  sysadmin  ioprofile  unix  tools  from delicious
october 2010 by jm
Deployment is just a part of dev/ops cooperation, not the whole thing
metrics, monitoring, instrumentation, fault tolerance, load mitigation called out as other factors by Allspaw
ops  deployment  operations  engineering  metrics  devops  monitoring  fault-tolerance  load  from delicious
december 2009 by jm
glTail.rb - realtime logfile visualization
'View real-time data and statistics from any logfile on any server with SSH, in an intuitive and entertaining way', supporting postfix/spamd/clamd logs among loads of others. very cool if a little silly
dataviz  visualization  tail  gltail  opengl  linux  apache  spamd  spamassassin  logs  statistics  sysadmin  analytics  animation  analysis  server  ruby  monitoring  logging  logfiles 
july 2009 by jm

related tags

agents  alarm-fatigue  alarming  alarms  alerting  alerts  allspaw  alpr  analysis  analytics  animation  anomaly-detection  anpr  apache  aphyr  architecture  aspirations  async  aurora  auto-remediation  availability  aws  backups  baron-schwartz  batching  bdb  ben-treynor  beringei  bitly  boundary  brendan-gregg  c  cache  canaries  capacity  cars  cassandra  change-monitoring  checklists  circuit-breakers  claspin  cleaner  clients  cloudwatch  clustering  codeascraft  coding  commandline  compression  confidence-bands  containerization  containers  coreos  correlation  cpu  cron  dashboards  data  datadog  dataviz  debug  debugging  deployment  deviation  devops  distcomp  distributed  django  docker  documentation  driving  dtrace  dynamodb  ebay  ebs  ec2  elasticsearch  emr  engineering  erasure-coding  erlang  etsy  facebook  false-positives  fault-tolerance  filtering  five-whys  forecasting  fs  games  gifee  github  gltail  go  google  grafana  graphing  graphite  graphs  gsm  hadoop  heap  heatmaps  heka  heroic  heroku  history  holt-winters  hostedgraphite  hotspot  howto  http  incident-response  incidents  influxdb  infrastructure  instrumentation  internet-scale  interviews  io  ioprofile  iostat  ip  iptables  james-hamilton  java  jmx  jmxtrans  jvm  kale  kinesis  kpl  kubernetes  life360  linkedin  linux  load  load-balancing  load-testing  logfiles  logging  logs  machine-learning  mdd  memcached  memory  metrics  mobile-phones  monitorama  monitoring  mozilla  mysql  nagios  netdata  netflix  netstat  network  networking  networks  nginx  ntp  nurse  oculus  open-source  opengl  operability  operations  ops  outages  pager  pager-duty  pages  passive-monitoring  pdf  percentiles  perf  performance  phones  pokemon  pokemon-go  post-mortems  postmortem  postmortems  presentations  prod  production  profiling  prometheus  pull  push  python  quantiles  queueing  raintank  rap-genius  rds  real-time  reliability  replicas  rest  rewrites  riemann  rkt  rob-ewaschuk  root-cause  rtt  ruby  runbooks  saas  search  server  service-metrics  severity  sjk  skyline  slides  soa  soap  software  soundcloud  space  spamassassin  spamd  spotify  sre  stack  stacks  statistics  stats  statsd  status  storage  strace  stream-processing  streaming  surveillance  syria  sysadmin  sysstat  systems  tahoe-lafs  tail  tcp  tech-talks  techops  testimonials  testing  threads  tier-one-support  time-series  timestamping  tips  tools  top  tracing  traffic  transparency  travel  tsd  twitter  udp  ui  unix  uptime  vector  via:feylya  via:hn  via:istvan  via:jk  via:jzawodny  via:markkenny  via:nelson  video  visualization  vividcortex  vm  voldemort  web  wiki  xooglers  yagni  zookeeper  zooko 

Copy this bookmark:



description:


tags: