**jm + histograms**
11

Developing a time-series "database" based on HdrHistogram

april 2017 by jm

Histogram aggregation is definitely a sensible way to store this kind of data

storage
elasticsearch
metrics
hdrhistogram
histograms
tideways
april 2017 by jm

You CAN Average Percentiles

july 2016 by jm

John Rauser on this oft-cited dictum of percentile usage in monitoring, and when it's wrong and it's actually possible to reason with averaged percentiles, and when it breaks down.

statistics
percentiles
quantiles
john-rauser
histograms
averaging
mean
p99
july 2016 by jm

HdrHistogram: A better latency capture method

february 2015 by jm

An excellent intro to HdrHistogram usage

hdrhistogram
hdr
histograms
statistics
latency
measurement
metrics
percentiles
quantiles
gil-tene
nitsan-wakart
february 2015 by jm

'Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm' [PDF]

november 2014 by jm

'Unsupervised anomaly detection is the process of finding outliers in data sets without prior training. In this paper, a histogram-based outlier detection (HBOS) algorithm is presented, which scores records in linear time. It assumes independence of the features making it much faster than multivariate approaches at the cost of less precision. A comparative evaluation on three UCI data sets and 10 standard algorithms show, that it can detect global outliers as reliable as state-of-the-art algorithms, but it performs poor on local outlier problems. HBOS is in our experiments up to 5 times faster than clustering based algorithms and up to 7 times faster than nearest-neighbor based methods.'

histograms
anomaly-detection
anomalies
machine-learning
algorithms
via:paperswelove
outliers
unsupervised-learning
hbos
november 2014 by jm

173 million 2013 NYC taxi rides shared on BigQuery : bigquery

july 2014 by jm

Interesting! (a) there's a subreddit for Google BigQuery, with links to interesting data sets, like this one; (b) the entire 173-million-row dataset for NYC taxi rides in 2013 is available for querying; and (c) the tip percentage histogram is cool.

datasets
bigquery
sql
google
nyc
new-york
taxis
data
big-data
histograms
tipping
july 2014 by jm

"Effective Computation of Biased Quantiles over Data Streams" [paper]

statistics
quantiles
percentiles
stream-processing
skew
papers
histograms
latency
algorithms

november 2013 by jm

Skew is prevalent in many data sources such as IP traffic streams.To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two prob-lems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively, using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the “high-biased” quantiles and the “targeted” quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures.Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over high-speed data streams.

Implemented as a timer-histogram storage system in http://armon.github.io/statsite/ .

november 2013 by jm

shades

june 2013 by jm

A command-line utility in Ruby to perform (a) OLAP cubing and (b) histogramming, given whitespace-delimited line data

ruby
olap
number-crunching
data
histograms
cli
june 2013 by jm

_Dynamic Histograms: Capturing Evolving Data Sets_ [pdf]

(via d2fn)
via:d2fn
histograms
streaming
big-data
data
dvo
dc
sado
dado
dynamic-histograms
papers
toread

may 2013 by jm

Currently, histograms are static structures: they are created from scratch periodically and their creation is based on looking at the entire data distribution as it exists each time. This creates problems, however, as data stored in DBMSs usually varies with time. If new data arrives at a high rate and old data is likewise deleted, a histogram’s accuracy may deteriorate fast as the histogram becomes older, and the optimizer’s effectiveness may be lost. Hence, how often a histogram is reconstructed becomes very critical, but choosing the right period is a hard problem, as the following trade-off exists: If the period is too long, histograms may become outdated. If the period is too short, updates of the histogram may incur a high overhead.

In this paper, we propose what we believe is the most elegant solution to the problem, i.e., maintaining dynamic histograms within given limits of memory space. Dynamic histograms are continuously updateable, closely tracking changes to the actual data. We consider two of the best static histograms proposed in the literature [9], namely V-Optimal and Compressed, and modify them. The new histograms are naturally called Dynamic V-Optimal (DVO) and Dynamic Compressed (DC). In addition, we modified V-Optimal’s partition constraint to create the Static Average-Deviation Optimal (SADO) and Dynamic Average-Deviation Optimal (DADO) histograms.

(via d2fn)

may 2013 by jm

good blog post on histogram-estimation stream processing algorithms

streams
streaming
stream-processing
histograms
percentiles
estimation
waves
statistics
algorithms

february 2013 by jm

After reviewing several dozen papers, a score or so in depth, I identified two data structures that appear to enable us to answer these recency and frequency queries: exponential histograms (from "Maintaining Stream Statistics Over Sliding Windows" by Datar et al.) and waves (from "Distributed Streams Algorithms for Sliding Windows" by Gibbons and Tirthapura). Both of these data structures are used to solve the so-called counting problem, the problem of determining, with a bound on the relative error, the number of 1s in the last N units of time. In other words, the data structures are able to answer the question: how many 1s appeared in the last n units of time within a factor of Error (e.g., 50%). The algorithms are neat, so I'll present them briefly.

february 2013 by jm

'Medians and Beyond: New Aggregation Techniques for Sensor Networks' [paper, PDF]

february 2013 by jm

'We introduce Quantile Digest or q-digest, a novel data structure which provides provable guarantees on approximation error and maximum resource consumption. In more concrete terms, if the values returned by the sensors are integers in the range [1;n], then using q-digest we can answer quantile queries using message size m within an error of O(log(n)/m). We also outline how we can use q-digest to answer other queries such as range queries, most frequent items and histograms. Another notable property of q-digest is that in addition to the theoretical worst case bound error, the structure carries with itself an estimate of error for this particular query.'

q-digest
algorithms
streams
approximation
histograms
median
percentiles
quantiles
february 2013 by jm

High Scalability - Analyzing billions of credit card transactions and serving low-latency insights in the cloud

february 2013 by jm

Hadoop, a batch-generated read-only Voldemort cluster, and an intriguing optimal-storage histogram bucketing algorithm:

scalability
scaling
voldemort
hadoop
batch
algorithms
histograms
statistics
bucketing
percentiles
The optimal histogram is computed using a random-restart hill climbing approximated algorithm.

The algorithm has been shown very fast and accurate: we achieved 99% accuracy compared to an exact dynamic algorithm, with a speed increase of one factor. [...] The amount of information to serve in Voldemort for one year of BBVA's credit card transactions on Spain is 270 GB. The whole processing flow would run in 11 hours on a cluster of 24 "m1.large" instances. The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month.

february 2013 by jm

**related tags**

Copy this bookmark: