Differential Privacy

june 2016 by jm

Apple have announced they plan to use it; Google use a DP algorithm called RAPPOR in Chrome usage statistics. In summary: "novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users".

apple
privacy
anonymization
google
rappor
algorithms
sampling
populations
statistics
differential-privacy
june 2016 by jm

Very Fast Reservoir Sampling

december 2015 by jm

via Tony Finch. 'In this post I will demonstrate how to do reservoir sampling orders of magnitude faster than the traditional “naive” reservoir sampling algorithm, using a fast high-fidelity approximation to the reservoir sampling-gap distribution.'

statistics
reservoir-sampling
sampling
algorithms
poisson
bernoulli
performance
december 2015 by jm

The reusable holdout: Preserving validity in adaptive data analysis

august 2015 by jm

Useful stats hack from Google: "We show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses."

statistics
google
reusable-holdout
training
ml
machine-learning
data-analysis
holdout
corpus
sampling
august 2015 by jm

"last seen" sketch

july 2015 by jm

a new sketch algorithm from Baron Schwartz and Preetam Jinka of VividCortex; similar to Count-Min but with last-seen timestamp instead of frequency.

sketch
algorithms
estimation
approximation
sampling
streams
big-data
july 2015 by jm

'Uncertain<T>: A First-Order Type for Uncertain Data' [paper, PDF]

december 2014 by jm

'Emerging applications increasingly use estimates such as sensor

data (GPS), probabilistic models, machine learning, big

data, and human data. Unfortunately, representing this uncertain

data with discrete types (floats, integers, and booleans)

encourages developers to pretend it is not probabilistic, which

causes three types of uncertainty bugs. (1) Using estimates

as facts ignores random error in estimates. (2) Computation

compounds that error. (3) Boolean questions on probabilistic

data induce false positives and negatives.

This paper introduces Uncertain<T>, a new programming

language abstraction for uncertain data. We implement a

Bayesian network semantics for computation and conditionals

that improves program correctness. The runtime uses sampling

and hypothesis tests to evaluate computation and conditionals

lazily and efficiently. We illustrate with sensor and

machine learning applications that Uncertain<T> improves

expressiveness and accuracy.'

(via Tony Finch)

via:fanf
uncertainty
estimation
types
strong-typing
coding
probability
statistics
machine-learning
sampling
data (GPS), probabilistic models, machine learning, big

data, and human data. Unfortunately, representing this uncertain

data with discrete types (floats, integers, and booleans)

encourages developers to pretend it is not probabilistic, which

causes three types of uncertainty bugs. (1) Using estimates

as facts ignores random error in estimates. (2) Computation

compounds that error. (3) Boolean questions on probabilistic

data induce false positives and negatives.

This paper introduces Uncertain<T>, a new programming

language abstraction for uncertain data. We implement a

Bayesian network semantics for computation and conditionals

that improves program correctness. The runtime uses sampling

and hypothesis tests to evaluate computation and conditionals

lazily and efficiently. We illustrate with sensor and

machine learning applications that Uncertain<T> improves

expressiveness and accuracy.'

(via Tony Finch)

december 2014 by jm

Tehuti

october 2014 by jm

An embryonic metrics library for Java/Scala from Felix GV at LinkedIn, extracted from Kafka's metric implementation and in the new Voldemort release. It fixes the major known problems with the Meter/Timer implementations in Coda-Hale/Dropwizard/Yammer Metrics.

'Regarding Tehuti: it has been extracted from Kafka's metric implementation. The code was originally written by Jay Kreps, and then maintained improved by some Kafka and Voldemort devs, so it definitely is not the work of just one person. It is in my repo at the moment but I'd like to put it in a more generally available (git and maven) repo in the future. I just haven't had the time yet...

As for comparing with CodaHale/Yammer, there were a few concerns with it, but the main one was that we didn't like the exponentially decaying histogram implementation. While that implementation is very appealing in terms of (low) memory usage, it has several misleading characteristics (a lack of incoming data points makes old measurements linger longer than they should, and there's also a fairly high possiblity of losing interesting outlier data points). This makes the exp decaying implementation robust in high throughput fairly constant workloads, but unreliable in sparse or spiky workloads. The Tehuti implementation provides semantics that we find easier to reason with and with a small code footprint (which we consider a plus in terms of maintainability). Of course, it is still a fairly young project, so it could be improved further.'

More background at the kafka-dev thread: http://mail-archives.apache.org/mod_mbox/kafka-dev/201402.mbox/%3C131A7649-ED57-45CB-B4D6-F34063267664@linkedin.com%3E

kafka
metrics
dropwizard
java
scala
jvm
timers
ewma
statistics
measurement
latency
sampling
tehuti
voldemort
linkedin
jay-kreps
'Regarding Tehuti: it has been extracted from Kafka's metric implementation. The code was originally written by Jay Kreps, and then maintained improved by some Kafka and Voldemort devs, so it definitely is not the work of just one person. It is in my repo at the moment but I'd like to put it in a more generally available (git and maven) repo in the future. I just haven't had the time yet...

As for comparing with CodaHale/Yammer, there were a few concerns with it, but the main one was that we didn't like the exponentially decaying histogram implementation. While that implementation is very appealing in terms of (low) memory usage, it has several misleading characteristics (a lack of incoming data points makes old measurements linger longer than they should, and there's also a fairly high possiblity of losing interesting outlier data points). This makes the exp decaying implementation robust in high throughput fairly constant workloads, but unreliable in sparse or spiky workloads. The Tehuti implementation provides semantics that we find easier to reason with and with a small code footprint (which we consider a plus in terms of maintainability). Of course, it is still a fairly young project, so it could be improved further.'

More background at the kafka-dev thread: http://mail-archives.apache.org/mod_mbox/kafka-dev/201402.mbox/%3C131A7649-ED57-45CB-B4D6-F34063267664@linkedin.com%3E

october 2014 by jm

Philippe Flajolet’s contribution to streaming algorithms [preso]

november 2013 by jm

Nice deck covering HyperLogLog and its origins, plus a slide at the end covering the Flajolet/Wegman Adaptive Sampling algorithm ("how do you count the number of elements which appear only once in stream using constant size memory?")

algorithms
sketching
hyperloglog
flajolet
wegman
adaptive-sampling
sampling
presentations
slides
november 2013 by jm

HdrHistogram by giltene

hdr
histogram
data-structures
coding
gil-tene
sampling
measuring

october 2013 by jm

A Histogram that supports recording and analyzing sampled data value counts across a configurable integer value range with configurable value precision within the range. Value precision is expressed as the number of significant digits in the value recording, and provides control over value quantization behavior across the value range and the subsequent value resolution at any given level.

october 2013 by jm

**related tags**

Copy this bookmark: