**jm + estimation**
24

The curious case of disappearing buses

5 weeks ago by jm

Nice investigation into some dodgy pseudo-real-time bus location data in the Bristol real time passenger information system (via Tony Finch)

via:fanf
bristol
buses
public-transport
rtpi
estimation
open-data
So what have we learned? One thing we are sure is that data of different qualities – genuinely real-time, pseudo real-time (Type 2 and Type 1), and non-real-time (scheduled) data – all present in the data stream.

Among these the most interesting are Type 2 pseudo real-time data. They appear to be the root cause of the phenomenon of disappearing buses.

Type 2 pseudo-real-time data are not totally bogus. One possible explanation of their existence can be this. The bus company has limited but not full tracking information on some of their buses. For example, it may know the location of a bus only when the bus leaves the bus terminal. Instead of not showing any data at all about the bus, the bus company uses interpolation to predict the locations of the bus, and reports these as if those are real-time data.

5 weeks ago by jm

HyperBitBit

march 2017 by jm

jomsdev notes:

'Last year, in the AofA’16 conference Robert Sedgewick proposed a new algorithm for cardinality estimation. Robert Sedgwick is a professor at Princeton with a long track of publications on combinatorial/randomized algorithms. He was a good friend of Philippe Flajolet (creator of Hyperloglog) and HyperBitBit it's based on the same ideas. However, it uses less memory than Hyperloglog and can provide the same results. On practical data, HyperBitBit, for N < 2^64 estimates cardinality within 10% using only 128 + 6 bits.'

algorithms
programming
cs
hyperloglog
estimation
cardinality
counting
hyperbitbit
'Last year, in the AofA’16 conference Robert Sedgewick proposed a new algorithm for cardinality estimation. Robert Sedgwick is a professor at Princeton with a long track of publications on combinatorial/randomized algorithms. He was a good friend of Philippe Flajolet (creator of Hyperloglog) and HyperBitBit it's based on the same ideas. However, it uses less memory than Hyperloglog and can provide the same results. On practical data, HyperBitBit, for N < 2^64 estimates cardinality within 10% using only 128 + 6 bits.'

march 2017 by jm

tdunning/t-digest

Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.
data-structures
algorithms
java
t-digest
statistics
quantiles
percentiles
aggregation
digests
estimation
ranking

december 2016 by jm

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.

Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.

december 2016 by jm

Fast Forward Labs: Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters

november 2016 by jm

Nice comparison of a counting Bloom filter and a Cuckoo Filter, implemented in Python:

algorithms
probabilistic
approximation
bloom-filters
cuckoo-filters
sets
estimation
python
This post provides an update by exploring Cuckoo filters, a new probabilistic data structure that improves upon the standard Bloom filter. The Cuckoo filter provides a few advantages: 1) it enables dynamic deletion and addition of items 2) it can be easily implemented compared to Bloom filter variants with similar capabilities, and 3) for similar space constraints, the Cuckoo filter provides lower false positives, particularly at lower capacities. We provide a python implementation of the Cuckoo filter here, and compare it to a counting Bloom filter (a Bloom filter variant).

november 2016 by jm

"last seen" sketch

july 2015 by jm

a new sketch algorithm from Baron Schwartz and Preetam Jinka of VividCortex; similar to Count-Min but with last-seen timestamp instead of frequency.

sketch
algorithms
estimation
approximation
sampling
streams
big-data
july 2015 by jm

hyperlogsandwich

(via Patrick McFadin)
via:patrickmcfadin
hyperloglog
cardinality
data-structures
algorithms
hyperlogsandwich
counting
estimation
lossy
multisets

may 2015 by jm

A probabilistic data structure for frequency/k-occurrence cardinality estimation of multisets. Sample implementation

(via Patrick McFadin)

may 2015 by jm

OpenJDK: jol

february 2015 by jm

'JOL (Java Object Layout) is the tiny toolbox to analyze object layout schemes in JVMs. These tools are using Unsafe, JVMTI, and Serviceability Agent (SA) heavily to decoder the actual object layout, footprint, and references. This makes JOL much more accurate than other tools relying on heap dumps, specification assumptions, etc.'

Recommended by Nitsan Wakart, looks pretty useful for JVM devs

java
jvm
tools
scala
memory
estimation
ram
object-layout
debugging
via:nitsan
Recommended by Nitsan Wakart, looks pretty useful for JVM devs

february 2015 by jm

'Uncertain<T>: A First-Order Type for Uncertain Data' [paper, PDF]

december 2014 by jm

'Emerging applications increasingly use estimates such as sensor

data (GPS), probabilistic models, machine learning, big

data, and human data. Unfortunately, representing this uncertain

data with discrete types (floats, integers, and booleans)

encourages developers to pretend it is not probabilistic, which

causes three types of uncertainty bugs. (1) Using estimates

as facts ignores random error in estimates. (2) Computation

compounds that error. (3) Boolean questions on probabilistic

data induce false positives and negatives.

This paper introduces Uncertain<T>, a new programming

language abstraction for uncertain data. We implement a

Bayesian network semantics for computation and conditionals

that improves program correctness. The runtime uses sampling

and hypothesis tests to evaluate computation and conditionals

lazily and efficiently. We illustrate with sensor and

machine learning applications that Uncertain<T> improves

expressiveness and accuracy.'

(via Tony Finch)

via:fanf
uncertainty
estimation
types
strong-typing
coding
probability
statistics
machine-learning
sampling
data (GPS), probabilistic models, machine learning, big

data, and human data. Unfortunately, representing this uncertain

data with discrete types (floats, integers, and booleans)

encourages developers to pretend it is not probabilistic, which

causes three types of uncertainty bugs. (1) Using estimates

as facts ignores random error in estimates. (2) Computation

compounds that error. (3) Boolean questions on probabilistic

data induce false positives and negatives.

This paper introduces Uncertain<T>, a new programming

language abstraction for uncertain data. We implement a

Bayesian network semantics for computation and conditionals

that improves program correctness. The runtime uses sampling

and hypothesis tests to evaluate computation and conditionals

lazily and efficiently. We illustrate with sensor and

machine learning applications that Uncertain<T> improves

expressiveness and accuracy.'

(via Tony Finch)

december 2014 by jm

CausalImpact: A new open-source package for estimating causal effects in time series

causal-inference
r
google
time-series
models
bayes
adwords
advertising
statistics
estimation
metrics

september 2014 by jm

How can we measure the number of additional clicks or sales that an AdWords campaign generated? How can we estimate the impact of a new feature on app downloads? How do we compare the effectiveness of publicity across countries?

In principle, all of these questions can be answered through causal inference.

In practice, estimating a causal effect accurately is hard, especially when a randomised experiment is not available. One approach we've been developing at Google is based on Bayesian structural time-series models. We use these models to construct a synthetic control — what would have happened to our outcome metric in the absence of the intervention. This approach makes it possible to estimate the causal effect that can be attributed to the intervention, as well as its evolution over time.

We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.

Today, we're excited to announce the release of CausalImpact, an open-source R package that makes causal analyses simple and fast. With its release, all of our advertisers and users will be able to use the same powerful methods for estimating causal effects that we've been using ourselves.

Our main motivation behind creating the package has been to find a better way of measuring the impact of ad campaigns on outcomes. However, the CausalImpact package could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, or the political and social sciences.

september 2014 by jm

HyperLogLog - Intersection Arithmetic

hyperloglog
hll
hyperloglogplus
streamlib
intersections
sets
estimation
algorithms

april 2014 by jm

'In general HLL intersection in StreamLib works. |A INTERSECT B|

= |A| + |B| - |A UNION B|. Timon's article on intersection is

important to read though. The usefulness of HLL intersection depends

on the features of the HLLs you are intersecting.'

april 2014 by jm

Druid | How We Scaled HyperLogLog: Three Real-World Optimizations

april 2014 by jm

3 optimizations Druid.io have made to the HLL algorithm to scale it up for production use in Metamarkets: compacting registers (fixes a bug with unions of multiple HLLs); a sparse storage format (to optimize space); faster lookups using a lookup table.

druid.io
metamarkets
scaling
hyperloglog
hll
algorithms
performance
optimization
counting
estimation
april 2014 by jm

Redis adds support for HyperLogLog

april 2014 by jm

good comment thread on HN, discussing hlld and bloomd as well

hll
bloom-filters
hyperloglog
redis
data-structures
estimation
cardinality
probabilistic
probability
hashing
random
april 2014 by jm

Recordinality

august 2013 by jm

a new, and interesting, sketching algorithm, with a Java implementation:

sketching
coding
algorithms
recordinality
cardinality
estimation
hll
hashing
murmurhash
java
Recordinality is unique in that it provides cardinality estimation like HLL, but also offers "distinct value sampling." This means that Recordinality can allow us to fetch a random sample of distinct elements in a stream, invariant to cardinality. Put more succinctly, given a stream of elements containing 1,000,000 occurrences of 'A' and one occurrence each of 'B' - 'Z', the probability of any letter appearing in our sample is equal. Moreover, we can also efficiently store the number of times elements in our distinct sample have been observed. This can help us to understand the distribution of occurrences of elements in our stream. With it, we can answer questions like "do the elements we've sampled present in a power law-like pattern, or is the distribution of occurrences relatively even across the set?"

august 2013 by jm

Sketch of the Day: K-Minimum Values

june 2013 by jm

Another sketching algorithm -- this one supports set union and intersection operations more easily than HyperLogLog when there are more than 2 sets

algorithms
coding
space-saving
cardinality
streams
stream-processing
estimation
sets
sketching
june 2013 by jm

hlld

(via:cscotta)
hyper-log-log
hlld
hll
data-structures
memcached
daemons
sketching
estimation
big-data
cardinality
algorithms
via:cscotta

june 2013 by jm

a high-performance C server which is used to expose HyperLogLog sets and operations over them to networked clients. It uses a simple ASCII protocol which is human readable, and similar to memcached.

HyperLogLog's are a relatively new sketching data structure. They are used to estimate cardinality, i.e. the unique number of items in a set. They are based on the observation that any bit in a "good" hash function is indepedenent of any other bit and that the probability of getting a string of N bits all set to the same value is 1/(2^N). There is a lot more in the math, but that is the basic intuition. What is even more incredible is that the storage required to do the counting is log(log(N)). So with a 6 bit register, we can count well into the trillions. For more information, its best to read the papers referenced at the end. TL;DR: HyperLogLogs enable you to have a set with about 1.6% variance, using 3280 bytes, and estimate sizes in the trillions.

(via:cscotta)

june 2013 by jm

Approximate Heavy Hitters -The SpaceSaving Algorithm

may 2013 by jm

nice, readable intro to SpaceSaving (which I've linked to before) -- a simple stream-processing cardinality top-K estimation algorithm with bounded error.

algorithms
coding
space-saving
cardinality
streams
stream-processing
estimation
may 2013 by jm

Big-O Algorithm Complexity Cheat Sheet

may 2013 by jm

nicely done, very readable

algorithms
reference
cheat-sheet
big-o
complexity
estimation
coding
may 2013 by jm

good blog post on histogram-estimation stream processing algorithms

streams
streaming
stream-processing
histograms
percentiles
estimation
waves
statistics
algorithms

february 2013 by jm

After reviewing several dozen papers, a score or so in depth, I identified two data structures that appear to enable us to answer these recency and frequency queries: exponential histograms (from "Maintaining Stream Statistics Over Sliding Windows" by Datar et al.) and waves (from "Distributed Streams Algorithms for Sliding Windows" by Gibbons and Tirthapura). Both of these data structures are used to solve the so-called counting problem, the problem of determining, with a bound on the relative error, the number of 1s in the last N units of time. In other words, the data structures are able to answer the question: how many 1s appeared in the last n units of time within a factor of Error (e.g., 50%). The algorithms are neat, so I'll present them briefly.

february 2013 by jm

Distributed Streams Algorithms for Sliding Windows [PDF]

february 2013 by jm

'Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate

functions over a “sliding window” of the N most recent data items in one or more streams. [...] Our results are obtained using a novel family of synopsis data structures called waves.'

waves
papers
streaming
algorithms
percentiles
histogram
distcomp
distributed
aggregation
statistics
estimation
streams
functions over a “sliding window” of the N most recent data items in one or more streams. [...] Our results are obtained using a novel family of synopsis data structures called waves.'

february 2013 by jm

Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure

february 2013 by jm

includes a nice javascript demo of HLL

hyperloglog
loglog
algorithms
stream-processing
streams
estimation
demos
javascript
february 2013 by jm

HyperLogLog++: Google’s Take On Engineering HLL

february 2013 by jm

Google and AggregateKnowledge's improvements to the HyperLogLog cardinality estimation algorithm

hyperloglog
cardinality
estimation
streaming
stream-processing
cep
february 2013 by jm

clearspring / stream-lib

february 2013 by jm

ASL-licensed open source library of stream-processing/approximation algorithms: count-min sketch, space-saving top-k, cardinality estimation, LogLog, HyperLogLog, MurmurHash, lookup3 hash, Bloom filters, q-digest, stochastic top-k

algorithms
coding
streams
cep
stream-processing
approximation
probabilistic
space-saving
top-k
cardinality
estimation
bloom-filters
q-digest
loglog
hyperloglog
murmurhash
lookup3
february 2013 by jm

Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog

may 2012 by jm

Stream summary, count-min sketches, loglog counting, linear counters. Some nifty algorithms for probabilistic estimation of element frequencies and data-set cardinality (via proggit)

via:proggit
algorithms
probability
probabilistic
count-min
stream-summary
loglog-counting
linear-counting
estimation
big-data
may 2012 by jm

The best "why estimation is hard" parable I've read this week

february 2012 by jm

'A tense silence falls between us. The phone call goes unmade. I'll call tomorrow once my comrade regains his senses and is willing to commit to something reasonable.'

agile
development
management
programming
teams
estimation
tasks
software
february 2012 by jm

**related tags**

Copy this bookmark: