jm + tsd   14

ASAP: Automatic Smoothing for Attention Prioritization in Streaming Time Series Visualization
Peter Bailis strikes again.

'Time series visualization of streaming telemetry (i.e., charting of
key metrics such as server load over time) is increasingly prevalent
in recent application deployments. Existing systems simply plot the
raw data streams as they arrive, potentially obscuring large-scale
deviations due to local variance and noise. We propose an alternative:
to better prioritize attention in time series exploration and
monitoring visualizations, smooth the time series as much as possible
to remove noise while still retaining large-scale structure. We
develop a new technique for automatically smoothing streaming
time series that adaptively optimizes this trade-off between noise
reduction (i.e., variance) and outlier retention (i.e., kurtosis). We
introduce metrics to quantitatively assess the quality of the choice
of smoothing parameter and provide an efficient streaming analytics
operator, ASAP, that optimizes these metrics by combining techniques
from stream processing, user interface design, and signal
processing via a novel autocorrelation-based pruning strategy and
pixel-aware preaggregation. We demonstrate that ASAP is able to
improve users’ accuracy in identifying significant deviations in time
series by up to 38.4% while reducing response times by up to 44.3%.
Moreover, ASAP delivers these results several orders of magnitude
faster than alternative optimization strategies.'
dataviz  graphs  metrics  peter-bailis  asap  smoothing  aggregation  time-series  tsd 
10 days ago by jm
Beringei: A high-performance time series storage engine | Engineering Blog | Facebook Code
Beringei is different from other in-memory systems, such as memcache, because it has been optimized for storing time series data used specifically for health and performance monitoring. We designed Beringei to have a very high write rate and a low read latency, while being as efficient as possible in using RAM to store the time series data. In the end, we created a system that can store all the performance and monitoring data generated at Facebook for the most recent 24 hours, allowing for extremely fast exploration and debugging of systems and services as we encounter issues in production.

Data compression was necessary to help reduce storage overhead. We considered several existing compression schemes and rejected the techniques that applied only to integer data, used approximation techniques, or needed to operate on the entire dataset. Beringei uses a lossless streaming compression algorithm to compress points within a time series with no additional compression used across time series. Each data point is a pair of 64-bit values representing the timestamp and value of the counter at that time. Timestamps and values are compressed separately using information about previous values. Timestamp compression uses a delta-of-delta encoding, so regular time series use very little memory to store timestamps.

From analyzing the data stored in our performance monitoring system, we discovered that the value in most time series does not change significantly when compared to its neighboring data points. Further, many data sources only store integers (despite the system supporting floating point values). Knowing this, we were able to tune previous academic work to be easier to compute by comparing the current value with the previous value using XOR, and storing the changed bits. Ultimately, this algorithm resulted in compressing the entire data set by at least 90 percent.
beringei  compression  facebook  monitoring  tsd  time-series  storage  architecture 
6 weeks ago by jm
Nobody Loves Graphite Anymore - VividCortex
Graphite has a place in our current monitoring stack, and together with StatsD will always have a special place in the hearts of DevOps practitioners everywhere, but it’s not representative of state-of-the-art in the last few years. Graphite is where the puck was in 2010. If you’re skating there, you’re missing the benefits of modern monitoring infrastructure.

The future I foresee is one where time series capabilities (the raw power needed, which I described in my time series requirements blog post, for example) are within everyone’s reach. That will be considered table stakes, whereas now it’s pretty revolutionary.

Like I've been saying -- we need Time Series As A Service! This should be undifferentiated heavy lifting.
graphite  tsd  time-series  vividcortex  statsd  ops  monitoring  metrics 
november 2015 by jm
The New InfluxDB Storage Engine: A Time Structured Merge Tree
The new engine has similarities with LSM Trees (like LevelDB and Cassandra’s underlying storage). It has a write ahead log, index files that are read only, and it occasionally performs compactions to combine index files. We’re calling it a Time Structured Merge Tree because the index files keep contiguous blocks of time and the compactions merge those blocks into larger blocks of time. Compression of the data improves as the index files are compacted. Once a shard becomes cold for writes it will be compacted into as few files as possible, which yield the best compression.
influxdb  storage  lsm-trees  leveldb  tsm-trees  data-structures  algorithms  time-series  tsd  compression 
october 2015 by jm
How We Scale VividCortex's Backend Systems - High Scalability
Excellent post from Baron Schwartz about their large-scale, 1-second-granularity time series database storage system
time-series  tsd  storage  mysql  sql  baron-schwartz  ops  performance  scalability  scaling  go 
march 2015 by jm
One year of InfluxDB and the road to 1.0
half of the [Monitorama] attendees were employees and entrepreneurs at monitoring, metrics, DevOps, and server analytics companies. Most of them had a story about how their metrics API was their key intellectual property that took them years to develop. The other half of the attendees were developers at larger organizations that were rolling their own DevOps stack from a collection of open source tools. Almost all of them were creating a “time series database” with a bunch of web services code on top of some other database or just using Graphite. When everyone is repeating the same work, it’s not key intellectual property or a differentiator, it’s a barrier to entry. Not only that, it’s something that is hindering innovation in this space since everyone has to spend their first year or two getting to the point where they can start building something real. It’s like building a web company in 1998. You have to spend millions of dollars and a year building infrastructure, racking servers, and getting everything ready before you could run the application. Monitoring and analytics applications should not be like this.
graphite  monitoring  metrics  tsd  time-series  analytics  influxdb  open-source 
february 2015 by jm
Observability at Twitter
Bit of detail into Twitter's TSD metric store.
There are separate online clusters for different data sets: application and operating system metrics, performance critical write-time aggregates, long term archives, and temporal indexes. A typical production instance of the time series database is based on four distinct Cassandra clusters, each responsible for a different dimension (real-time, historical, aggregate, index) due to different performance constraints. These clusters are amongst the largest Cassandra clusters deployed in production today and account for over 500 million individual metric writes per minute. Archival data is stored at a lower resolution for trending and long term analysis, whereas higher resolution data is periodically expired. Aggregation is generally performed at write-time to avoid extra storage operations for metrics that are expected to be immediately consumed. Indexing occurs along several dimensions–service, source, and metric names–to give users some flexibility in finding relevant data.
twitter  monitoring  metrics  service-metrics  tsd  time-series  storage  architecture  cassandra 
september 2013 by jm
Blueflood by rackerlabs
Rackspace's large-scale TSD storage system, built on Cassandra, Java, ASL2
cassandra  tsd  storage  time-series  data  open-source  java  rackspace 
september 2013 by jm
Boundary Product Update: Trends Dashboard Now Available
Boundary implement week-on-week trend display. Pity they use silly "giant number" dashboard boxes showing comparisons of the current datapoint with the previous week's datapoint; there's no indication of smoothing being applied, and "giant number" dashboards are basically useless anyway compared to a time-series graph, for unsmoothed time-series data. Also, no prediction bands. :(
boundary  time-series  tsd  prediction  metrics  smoothing  dataviz  dashboards 
april 2013 by jm
Boundary Techtalk - Large-scale OLAP with Kobayashi
Boundary on their TSD-on-Riak store.
Dietrich Featherston, Engineer at Boundary, walks through the process of designing Kobayashi, the time-series analytics database behind our network metrics. He goes through the false-starts and lessons learned in effectively using Riak as the storage layer for a large-scale OLAP database.  The system is ultimately capable of answering complex, ad-hoc queries at interactive latencies.
video  boundary  tsd  riak  eventual-consistency  storage  kobayashi  olap  time-series 
april 2013 by jm
Metric Collection and Storage with Cassandra | DataStax
DataStax' documentation on how they store TSD data in Cass. Pretty generic
datastax  nosql  metrics  analytics  cassandra  tsd  time-series  storage 
march 2013 by jm
'a D3 plugin for visualizing time series. Use Cubism to construct better realtime dashboards.' Apache-licensed; nice realtime update style; overlays multiple data sources well. I think I now have a good use-case for this
javascript  library  visualization  dataviz  tsd  data  apache  open-source 
april 2012 by jm
'Our goal is to create the world's fastest extendable, non-transactional time series database for big data (you know, for kids)! Log file indexing is our initial focus. For example append only ASCII files produced by libraries like Log4J, or containing FIX messages or JSON objects. Occursions was built by a small team sick of creating hacks to remotely copy and/or grep through tons of large log files. We use it to index around a terabyte of new log data per day. Occursions asynchronously tails log files and indexes the individual lines in each log file as each line is written to disk so you don't even have to wait for a second after an event happens to search for it. Occursions uses custom disk backed data structures to create and search its indexes so it is very efficient at using CPU, memory and disk.'
logs  search  tsd  big-data  log4j  via:proggit 
march 2012 by jm
dygraphs JavaScript Visualization Library
'an open source JavaScript library that produces produces interactive, zoomable charts of time series. It is designed to display dense data sets and enable users to explore and interpret them.' quite pretty
time-series  data  tsd  graphs  charts  javascript  via:reddit  dataviz  visualization  opensource  dygraphs  from delicious
december 2009 by jm

Copy this bookmark: