jm + sampling   8

Differential Privacy
Apple have announced they plan to use it; Google use a DP algorithm called RAPPOR in Chrome usage statistics. In summary: "novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users".
apple  privacy  anonymization  google  rappor  algorithms  sampling  populations  statistics  differential-privacy 
june 2016 by jm
Very Fast Reservoir Sampling
via Tony Finch. 'In this post I will demonstrate how to do reservoir sampling orders of magnitude faster than the traditional “naive” reservoir sampling algorithm, using a fast high-fidelity approximation to the reservoir sampling-gap distribution.'
statistics  reservoir-sampling  sampling  algorithms  poisson  bernoulli  performance 
december 2015 by jm
The reusable holdout: Preserving validity in adaptive data analysis
Useful stats hack from Google: "We show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses."
statistics  google  reusable-holdout  training  ml  machine-learning  data-analysis  holdout  corpus  sampling 
august 2015 by jm
"last seen" sketch
a new sketch algorithm from Baron Schwartz and Preetam Jinka of VividCortex; similar to Count-Min but with last-seen timestamp instead of frequency.
sketch  algorithms  estimation  approximation  sampling  streams  big-data 
july 2015 by jm
'Uncertain<T>: A First-Order Type for Uncertain Data' [paper, PDF]
'Emerging applications increasingly use estimates such as sensor
data (GPS), probabilistic models, machine learning, big
data, and human data. Unfortunately, representing this uncertain
data with discrete types (floats, integers, and booleans)
encourages developers to pretend it is not probabilistic, which
causes three types of uncertainty bugs. (1) Using estimates
as facts ignores random error in estimates. (2) Computation
compounds that error. (3) Boolean questions on probabilistic
data induce false positives and negatives.
This paper introduces Uncertain<T>, a new programming
language abstraction for uncertain data. We implement a
Bayesian network semantics for computation and conditionals
that improves program correctness. The runtime uses sampling
and hypothesis tests to evaluate computation and conditionals
lazily and efficiently. We illustrate with sensor and
machine learning applications that Uncertain<T> improves
expressiveness and accuracy.'

(via Tony Finch)
via:fanf  uncertainty  estimation  types  strong-typing  coding  probability  statistics  machine-learning  sampling 
december 2014 by jm
An embryonic metrics library for Java/Scala from Felix GV at LinkedIn, extracted from Kafka's metric implementation and in the new Voldemort release. It fixes the major known problems with the Meter/Timer implementations in Coda-Hale/Dropwizard/Yammer Metrics.

'Regarding Tehuti: it has been extracted from Kafka's metric implementation. The code was originally written by Jay Kreps, and then maintained improved by some Kafka and Voldemort devs, so it definitely is not the work of just one person. It is in my repo at the moment but I'd like to put it in a more generally available (git and maven) repo in the future. I just haven't had the time yet...

As for comparing with CodaHale/Yammer, there were a few concerns with it, but the main one was that we didn't like the exponentially decaying histogram implementation. While that implementation is very appealing in terms of (low) memory usage, it has several misleading characteristics (a lack of incoming data points makes old measurements linger longer than they should, and there's also a fairly high possiblity of losing interesting outlier data points). This makes the exp decaying implementation robust in high throughput fairly constant workloads, but unreliable in sparse or spiky workloads. The Tehuti implementation provides semantics that we find easier to reason with and with a small code footprint (which we consider a plus in terms of maintainability). Of course, it is still a fairly young project, so it could be improved further.'

More background at the kafka-dev thread:
kafka  metrics  dropwizard  java  scala  jvm  timers  ewma  statistics  measurement  latency  sampling  tehuti  voldemort  linkedin  jay-kreps 
october 2014 by jm
Philippe Flajolet’s contribution to streaming algorithms [preso]
Nice deck covering HyperLogLog and its origins, plus a slide at the end covering the Flajolet/Wegman Adaptive Sampling algorithm ("how do you count the number of elements which appear only once in stream using constant size memory?")
algorithms  sketching  hyperloglog  flajolet  wegman  adaptive-sampling  sampling  presentations  slides 
november 2013 by jm
HdrHistogram by giltene
A Histogram that supports recording and analyzing sampled data value counts across a configurable integer value range with configurable value precision within the range. Value precision is expressed as the number of significant digits in the value recording, and provides control over value quantization behavior across the value range and the subsequent value resolution at any given level.
hdr  histogram  data-structures  coding  gil-tene  sampling  measuring 
october 2013 by jm

Copy this bookmark: