**jm + probability**
11

The Impenetrable Program Transforming How Courts Treat DNA Evidence | WIRED

november 2017 by jm

'So the lab turned to TrueAllele, a program sold by Cybergenetics, a small company dedicated to helping law enforcement analyze DNA where regular lab tests fail. They do it with something called probabilistic genotyping, which uses complex mathematical formulas to examine the statistical likelihood that a certain genotype comes from one individual over another. It’s a type of DNA testing that’s becoming increasingly popular in courtrooms. '

[...] 'But now legal experts, along with Johnson’s advocates, are joining forces to argue to a California court that TrueAllele—the seemingly magic software that helped law enforcement analyze the evidence that tied Johnson to the crimes—should be forced to reveal the code that sent Johnson to prison. This code, they say, is necessary in order to properly evaluate the technology. In fact, they say, justice from an unknown algorithm is no justice at all.'

law
justice
trueallele
software
dna
evidence
statistics
probability
code-review
auditing
[...] 'But now legal experts, along with Johnson’s advocates, are joining forces to argue to a California court that TrueAllele—the seemingly magic software that helped law enforcement analyze the evidence that tied Johnson to the crimes—should be forced to reveal the code that sent Johnson to prison. This code, they say, is necessary in order to properly evaluate the technology. In fact, they say, justice from an unknown algorithm is no justice at all.'

november 2017 by jm

Darts, Dice, and Coins

(via Marc Brooker)
via:marcbrooker
algorithms
probability
algorithm
coding
data-structures
alias
dice
random

april 2016 by jm

Earlier this year, I asked a question on Stack Overflow about a data structure for loaded dice. Specifically, I was interested in answering this question: "You are given an n-sided die where side i has probability pi of being rolled. What is the most efficient data structure for simulating rolls of the die?"

This data structure could be used for many purposes. For starters, you could use it to simulate rolls of a fair, six-sided die by assigning probability 1616 to each of the sides of the die, or a to simulate a fair coin by simulating a two-sided die where each side has probability 1212 of coming up. You could also use this data structure to directly simulate the total of two fair six-sided dice being thrown by having an 11-sided die (whose faces were 2, 3, 4, ..., 12), where each side was appropriately weighted with the probability that this total would show if you used two fair dice. However, you could also use this data structure to simulate loaded dice. For example, if you were playing craps with dice that you knew weren't perfectly fair, you might use the data structure to simulate many rolls of the dice to see what the optimal strategy would be. You could also consider simulating an imperfect roulette wheel in the same way.

Outside the domain of game-playing, you could also use this data structure in robotics simulations where sensors have known failure rates. For example, if a range sensor has a 95% chance of giving the right value back, a 4% chance of giving back a value that's too small, and a 1% chance of handing back a value that's too large, you could use this data structure to simulate readings from the sensor by generating a random outcome and simulating the sensor reading in that case.

The answer I received on Stack Overflow impressed me for two reasons. First, the solution pointed me at a powerful technique called the alias method that, under certain reasonable assumptions about the machine model, is capable of simulating rolls of the die in O(1)O(1) time after a simple preprocessing step. Second, and perhaps more surprisingly, this algorithm has been known for decades, but I had not once encountered it! Considering how much processing time is dedicated to simulation, I would have expected this technique to be better- known. A few quick Google searches turned up a wealth of information on the technique, but I couldn't find a single site that compiled together the intuition and explanation behind the technique.

(via Marc Brooker)

april 2016 by jm

The general birthday problem

february 2016 by jm

Good explanation and scipy code for the birthday paradox and hash collisions

hashing
hashes
collisions
birthday-problem
birthday-paradox
coding
probability
statistics
february 2016 by jm

Birthday problem calculator

december 2015 by jm

I keep having to google this, so here's a good one which works -- unlike Wolfram Alpha!

birthday
birthday-paradox
birthday-problem
hashes
hash-collision
attacks
security
collisions
calculators
probability
statistcs
december 2015 by jm

'Uncertain<T>: A First-Order Type for Uncertain Data' [paper, PDF]

december 2014 by jm

'Emerging applications increasingly use estimates such as sensor

data (GPS), probabilistic models, machine learning, big

data, and human data. Unfortunately, representing this uncertain

data with discrete types (floats, integers, and booleans)

encourages developers to pretend it is not probabilistic, which

causes three types of uncertainty bugs. (1) Using estimates

as facts ignores random error in estimates. (2) Computation

compounds that error. (3) Boolean questions on probabilistic

data induce false positives and negatives.

This paper introduces Uncertain<T>, a new programming

language abstraction for uncertain data. We implement a

Bayesian network semantics for computation and conditionals

that improves program correctness. The runtime uses sampling

and hypothesis tests to evaluate computation and conditionals

lazily and efficiently. We illustrate with sensor and

machine learning applications that Uncertain<T> improves

expressiveness and accuracy.'

(via Tony Finch)

via:fanf
uncertainty
estimation
types
strong-typing
coding
probability
statistics
machine-learning
sampling
data (GPS), probabilistic models, machine learning, big

data, and human data. Unfortunately, representing this uncertain

data with discrete types (floats, integers, and booleans)

encourages developers to pretend it is not probabilistic, which

causes three types of uncertainty bugs. (1) Using estimates

as facts ignores random error in estimates. (2) Computation

compounds that error. (3) Boolean questions on probabilistic

data induce false positives and negatives.

This paper introduces Uncertain<T>, a new programming

language abstraction for uncertain data. We implement a

Bayesian network semantics for computation and conditionals

that improves program correctness. The runtime uses sampling

and hypothesis tests to evaluate computation and conditionals

lazily and efficiently. We illustrate with sensor and

machine learning applications that Uncertain<T> improves

expressiveness and accuracy.'

(via Tony Finch)

december 2014 by jm

What's the probability of a hash collision?

november 2014 by jm

Handy calculator

probability
hashing
hashes
collision
risk
md5
sha
sha1
calculators
november 2014 by jm

Redis adds support for HyperLogLog

april 2014 by jm

good comment thread on HN, discussing hlld and bloomd as well

hll
bloom-filters
hyperloglog
redis
data-structures
estimation
cardinality
probabilistic
probability
hashing
random
april 2014 by jm

How the search for flight AF447 used Bayesian inference

march 2014 by jm

Via jgc, the search for the downed Air France flight was optimized using this technique:

'Metron’s approach to this search planning problem is rooted in classical Bayesian inference,

which allows organization of available data with associated uncertainties and computation of the

Probability Distribution Function (PDF) for target location given these data. In following this

approach, the first step was to gather the available information about the location of the impact site

of the aircraft. This information was sometimes contradictory and filled with ambiguities and

uncertainties. Using a Bayesian approach we organized this material into consistent scenarios,

quantified the uncertainties with probability distributions, weighted the relative likelihood of each

scenario, and performed a simulation to produce a prior PDF for the location of the wreck.'

metron
bayes
bayesian-inference
machine-learning
statistics
via:jgc
air-france
disasters
probability
inference
searching
'Metron’s approach to this search planning problem is rooted in classical Bayesian inference,

which allows organization of available data with associated uncertainties and computation of the

Probability Distribution Function (PDF) for target location given these data. In following this

approach, the first step was to gather the available information about the location of the impact site

of the aircraft. This information was sometimes contradictory and filled with ambiguities and

uncertainties. Using a Bayesian approach we organized this material into consistent scenarios,

quantified the uncertainties with probability distributions, weighted the relative likelihood of each

scenario, and performed a simulation to produce a prior PDF for the location of the wreck.'

march 2014 by jm

Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog

may 2012 by jm

Stream summary, count-min sketches, loglog counting, linear counters. Some nifty algorithms for probabilistic estimation of element frequencies and data-set cardinality (via proggit)

via:proggit
algorithms
probability
probabilistic
count-min
stream-summary
loglog-counting
linear-counting
estimation
big-data
may 2012 by jm

Wired: how a Toronto statistician cracked the state lottery

february 2011 by jm

'The tic-tac-toe lottery was seriously flawed. It took a few hours of studying his tickets and some statistical sleuthing, but he discovered a defect in the game: The visible numbers turned out to reveal essential information about the digits hidden under the latex coating. Nothing needed to be scratched off—the ticket could be cracked if you knew the secret code.'

toronto
hacks
money
statistics
probability
wired
tic-tac-toe
singleton
from delicious
february 2011 by jm

Colm argues against the 'sleep rand % 3600' hack

september 2009 by jm

it's not sufficiently evenly-distributed, apparently. Also: got linked from Hack The Planet!

scheduling
probability
sleep
unix
updating
cron
random
from delicious
september 2009 by jm

**related tags**

Copy this bookmark: