When Boring is Awesome: Building a scalable time-series database on PostgreSQL
april 2017 by jm
Nice. we built something along these lines atop MySQL before -- partitioning by timestamp is the key. (via Nelson)
database
postgresql
postgres
timeseries
tsd
storage
state
via:nelson
april 2017 by jm
ASAP: Automatic Smoothing for Attention Prioritization in Streaming Time Series Visualization
march 2017 by jm
Peter Bailis strikes again.
'Time series visualization of streaming telemetry (i.e., charting of
key metrics such as server load over time) is increasingly prevalent
in recent application deployments. Existing systems simply plot the
raw data streams as they arrive, potentially obscuring large-scale
deviations due to local variance and noise. We propose an alternative:
to better prioritize attention in time series exploration and
monitoring visualizations, smooth the time series as much as possible
to remove noise while still retaining large-scale structure. We
develop a new technique for automatically smoothing streaming
time series that adaptively optimizes this trade-off between noise
reduction (i.e., variance) and outlier retention (i.e., kurtosis). We
introduce metrics to quantitatively assess the quality of the choice
of smoothing parameter and provide an efficient streaming analytics
operator, ASAP, that optimizes these metrics by combining techniques
from stream processing, user interface design, and signal
processing via a novel autocorrelation-based pruning strategy and
pixel-aware preaggregation. We demonstrate that ASAP is able to
improve users’ accuracy in identifying significant deviations in time
series by up to 38.4% while reducing response times by up to 44.3%.
Moreover, ASAP delivers these results several orders of magnitude
faster than alternative optimization strategies.'
dataviz
graphs
metrics
peter-bailis
asap
smoothing
aggregation
time-series
tsd
'Time series visualization of streaming telemetry (i.e., charting of
key metrics such as server load over time) is increasingly prevalent
in recent application deployments. Existing systems simply plot the
raw data streams as they arrive, potentially obscuring large-scale
deviations due to local variance and noise. We propose an alternative:
to better prioritize attention in time series exploration and
monitoring visualizations, smooth the time series as much as possible
to remove noise while still retaining large-scale structure. We
develop a new technique for automatically smoothing streaming
time series that adaptively optimizes this trade-off between noise
reduction (i.e., variance) and outlier retention (i.e., kurtosis). We
introduce metrics to quantitatively assess the quality of the choice
of smoothing parameter and provide an efficient streaming analytics
operator, ASAP, that optimizes these metrics by combining techniques
from stream processing, user interface design, and signal
processing via a novel autocorrelation-based pruning strategy and
pixel-aware preaggregation. We demonstrate that ASAP is able to
improve users’ accuracy in identifying significant deviations in time
series by up to 38.4% while reducing response times by up to 44.3%.
Moreover, ASAP delivers these results several orders of magnitude
faster than alternative optimization strategies.'
march 2017 by jm
Beringei: A high-performance time series storage engine | Engineering Blog | Facebook Code
beringei
compression
facebook
monitoring
tsd
time-series
storage
architecture
february 2017 by jm
Beringei is different from other in-memory systems, such as memcache, because it has been optimized for storing time series data used specifically for health and performance monitoring. We designed Beringei to have a very high write rate and a low read latency, while being as efficient as possible in using RAM to store the time series data. In the end, we created a system that can store all the performance and monitoring data generated at Facebook for the most recent 24 hours, allowing for extremely fast exploration and debugging of systems and services as we encounter issues in production.
Data compression was necessary to help reduce storage overhead. We considered several existing compression schemes and rejected the techniques that applied only to integer data, used approximation techniques, or needed to operate on the entire dataset. Beringei uses a lossless streaming compression algorithm to compress points within a time series with no additional compression used across time series. Each data point is a pair of 64-bit values representing the timestamp and value of the counter at that time. Timestamps and values are compressed separately using information about previous values. Timestamp compression uses a delta-of-delta encoding, so regular time series use very little memory to store timestamps.
From analyzing the data stored in our performance monitoring system, we discovered that the value in most time series does not change significantly when compared to its neighboring data points. Further, many data sources only store integers (despite the system supporting floating point values). Knowing this, we were able to tune previous academic work to be easier to compute by comparing the current value with the previous value using XOR, and storing the changed bits. Ultimately, this algorithm resulted in compressing the entire data set by at least 90 percent.
february 2017 by jm
Nobody Loves Graphite Anymore - VividCortex
Like I've been saying -- we need Time Series As A Service! This should be undifferentiated heavy lifting.
graphite
tsd
time-series
vividcortex
statsd
ops
monitoring
metrics
november 2015 by jm
Graphite has a place in our current monitoring stack, and together with StatsD will always have a special place in the hearts of DevOps practitioners everywhere, but it’s not representative of state-of-the-art in the last few years. Graphite is where the puck was in 2010. If you’re skating there, you’re missing the benefits of modern monitoring infrastructure.
The future I foresee is one where time series capabilities (the raw power needed, which I described in my time series requirements blog post, for example) are within everyone’s reach. That will be considered table stakes, whereas now it’s pretty revolutionary.
Like I've been saying -- we need Time Series As A Service! This should be undifferentiated heavy lifting.
november 2015 by jm
The New InfluxDB Storage Engine: A Time Structured Merge Tree
influxdb
storage
lsm-trees
leveldb
tsm-trees
data-structures
algorithms
time-series
tsd
compression
october 2015 by jm
The new engine has similarities with LSM Trees (like LevelDB and Cassandra’s underlying storage). It has a write ahead log, index files that are read only, and it occasionally performs compactions to combine index files. We’re calling it a Time Structured Merge Tree because the index files keep contiguous blocks of time and the compactions merge those blocks into larger blocks of time. Compression of the data improves as the index files are compacted. Once a shard becomes cold for writes it will be compacted into as few files as possible, which yield the best compression.
october 2015 by jm
How We Scale VividCortex's Backend Systems - High Scalability
march 2015 by jm
Excellent post from Baron Schwartz about their large-scale, 1-second-granularity time series database storage system
time-series
tsd
storage
mysql
sql
baron-schwartz
ops
performance
scalability
scaling
go
march 2015 by jm
One year of InfluxDB and the road to 1.0
graphite
monitoring
metrics
tsd
time-series
analytics
influxdb
open-source
february 2015 by jm
half of the [Monitorama] attendees were employees and entrepreneurs at monitoring, metrics, DevOps, and server analytics companies. Most of them had a story about how their metrics API was their key intellectual property that took them years to develop. The other half of the attendees were developers at larger organizations that were rolling their own DevOps stack from a collection of open source tools. Almost all of them were creating a “time series database” with a bunch of web services code on top of some other database or just using Graphite. When everyone is repeating the same work, it’s not key intellectual property or a differentiator, it’s a barrier to entry. Not only that, it’s something that is hindering innovation in this space since everyone has to spend their first year or two getting to the point where they can start building something real. It’s like building a web company in 1998. You have to spend millions of dollars and a year building infrastructure, racking servers, and getting everything ready before you could run the application. Monitoring and analytics applications should not be like this.
february 2015 by jm
Observability at Twitter
september 2013 by jm
Bit of detail into Twitter's TSD metric store.
twitter
monitoring
metrics
service-metrics
tsd
time-series
storage
architecture
cassandra
There are separate online clusters for different data sets: application and operating system metrics, performance critical write-time aggregates, long term archives, and temporal indexes. A typical production instance of the time series database is based on four distinct Cassandra clusters, each responsible for a different dimension (real-time, historical, aggregate, index) due to different performance constraints. These clusters are amongst the largest Cassandra clusters deployed in production today and account for over 500 million individual metric writes per minute. Archival data is stored at a lower resolution for trending and long term analysis, whereas higher resolution data is periodically expired. Aggregation is generally performed at write-time to avoid extra storage operations for metrics that are expected to be immediately consumed. Indexing occurs along several dimensions–service, source, and metric names–to give users some flexibility in finding relevant data.
september 2013 by jm
Blueflood by rackerlabs
september 2013 by jm
Rackspace's large-scale TSD storage system, built on Cassandra, Java, ASL2
cassandra
tsd
storage
time-series
data
open-source
java
rackspace
september 2013 by jm
Boundary Product Update: Trends Dashboard Now Available
april 2013 by jm
Boundary implement week-on-week trend display. Pity they use silly "giant number" dashboard boxes showing comparisons of the current datapoint with the previous week's datapoint; there's no indication of smoothing being applied, and "giant number" dashboards are basically useless anyway compared to a time-series graph, for unsmoothed time-series data. Also, no prediction bands. :(
boundary
time-series
tsd
prediction
metrics
smoothing
dataviz
dashboards
april 2013 by jm
Boundary Techtalk - Large-scale OLAP with Kobayashi
april 2013 by jm
Boundary on their TSD-on-Riak store.
video
boundary
tsd
riak
eventual-consistency
storage
kobayashi
olap
time-series
Dietrich Featherston, Engineer at Boundary, walks through the process of designing Kobayashi, the time-series analytics database behind our network metrics. He goes through the false-starts and lessons learned in effectively using Riak as the storage layer for a large-scale OLAP database. The system is ultimately capable of answering complex, ad-hoc queries at interactive latencies.
april 2013 by jm
Metric Collection and Storage with Cassandra | DataStax
march 2013 by jm
DataStax' documentation on how they store TSD data in Cass. Pretty generic
datastax
nosql
metrics
analytics
cassandra
tsd
time-series
storage
march 2013 by jm
Cubism.js
april 2012 by jm
'a D3 plugin for visualizing time series. Use Cubism to construct better realtime dashboards.' Apache-licensed; nice realtime update style; overlays multiple data sources well. I think I now have a good use-case for this
javascript
library
visualization
dataviz
tsd
data
apache
open-source
april 2012 by jm
Occursions
march 2012 by jm
'Our goal is to create the world's fastest extendable, non-transactional time series database for big data (you know, for kids)! Log file indexing is our initial focus. For example append only ASCII files produced by libraries like Log4J, or containing FIX messages or JSON objects. Occursions was built by a small team sick of creating hacks to remotely copy and/or grep through tons of large log files. We use it to index around a terabyte of new log data per day. Occursions asynchronously tails log files and indexes the individual lines in each log file as each line is written to disk so you don't even have to wait for a second after an event happens to search for it. Occursions uses custom disk backed data structures to create and search its indexes so it is very efficient at using CPU, memory and disk.'
logs
search
tsd
big-data
log4j
via:proggit
march 2012 by jm
dygraphs JavaScript Visualization Library
december 2009 by jm
'an open source JavaScript library that produces produces interactive, zoomable charts of time series. It is designed to display dense data sets and enable users to explore and interpret them.' quite pretty
time-series
data
tsd
graphs
charts
javascript
via:reddit
dataviz
visualization
opensource
dygraphs
from delicious
december 2009 by jm
related tags
aggregation ⊕ algorithms ⊕ analytics ⊕ apache ⊕ architecture ⊕ asap ⊕ baron-schwartz ⊕ beringei ⊕ big-data ⊕ boundary ⊕ cassandra ⊕ charts ⊕ compression ⊕ dashboards ⊕ data ⊕ data-structures ⊕ database ⊕ datastax ⊕ dataviz ⊕ dygraphs ⊕ eventual-consistency ⊕ facebook ⊕ go ⊕ graphite ⊕ graphs ⊕ influxdb ⊕ java ⊕ javascript ⊕ kobayashi ⊕ leveldb ⊕ library ⊕ log4j ⊕ logs ⊕ lsm-trees ⊕ metrics ⊕ monitoring ⊕ mysql ⊕ nosql ⊕ olap ⊕ open-source ⊕ opensource ⊕ ops ⊕ performance ⊕ peter-bailis ⊕ postgres ⊕ postgresql ⊕ prediction ⊕ rackspace ⊕ riak ⊕ scalability ⊕ scaling ⊕ search ⊕ service-metrics ⊕ smoothing ⊕ sql ⊕ state ⊕ statsd ⊕ storage ⊕ time-series ⊕ timeseries ⊕ tsd ⊖ tsm-trees ⊕ twitter ⊕ via:nelson ⊕ via:proggit ⊕ via:reddit ⊕ video ⊕ visualization ⊕ vividcortex ⊕Copy this bookmark: