jm + statistics   65

Physical separation of cyclists from traffic “crucial” to dropping injury rates, shows U.S. study
Citing a further study of differing types of cycling infrastructure in Canada, the editorial writes that an 89% increase in safety was noted on streets with physical separation over streets where no such infrastructure existed. Unprotected cycling space was found to be 53% safer.

In 2014 there were 902 recorded cyclists fatalities in America and 35,206 serious injuries. Per kilometre cycled fatalities per 100 million kilometres cycled sat at 4.7. In the Netherlands and Denmark those rates sit at 1 and 1.1, respectively.
cycling  infrastructure  roads  safety  accidents  cars  statistics  us  canada 
10 weeks ago by jm
tdunning/t-digest
A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.


Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.
data-structures  algorithms  java  t-digest  statistics  quantiles  percentiles  aggregation  digests  estimation  ranking 
december 2016 by jm
The Fall of BIG DATA – arg min blog
Strongly agreed with this -- particularly the second of the three major failures, specifically:
Our community has developed remarkably effective tools to microtarget advertisements. But if you use ad models to deliver news, that’s propaganda. And just because we didn’t intend to spread rampant misinformation doesn’t mean we are not responsible.
big-data  analytics  data-science  statistics  us-politics  trump  data  science  propaganda  facebook  silicon-valley 
november 2016 by jm
How One 19-Year-Old Illinois Man Is Distorting National Polling Averages - The New York Times
One "outlier" voter—a 19-year old black Trump supporter—was weighted so heavily that it shifted the whole poll significantly. Stats fail
statistics  nytimes  politics  via:reddit  donald-trump  hilary-clinton  polling  panels  polls 
october 2016 by jm
MRI software bugs could upend years of research - The Register
In their paper at PNAS, they write: “the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.”

For example, a bug that's been sitting in a package called 3dClustSim for 15 years, fixed in May 2015, produced bad results (3dClustSim is part of the AFNI suite; the others are SPM and FSL). That's not a gentle nudge that some results might be overstated: it's more like making a bonfire of thousands of scientific papers.

Further: “Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape”.

The researchers used published fMRI results, and along the way they swipe the fMRI community for their “lamentable archiving and data-sharing practices” that prevent most of the discipline's body of work being re-analysed. ®
fmri  science  mri  statistics  cluster-inference  autocorrelation  data  papers  medicine  false-positives  fps  neuroimaging 
july 2016 by jm
You CAN Average Percentiles
John Rauser on this oft-cited dictum of percentile usage in monitoring, and when it's wrong and it's actually possible to reason with averaged percentiles, and when it breaks down.
statistics  percentiles  quantiles  john-rauser  histograms  averaging  mean  p99 
july 2016 by jm
Differential Privacy
Apple have announced they plan to use it; Google use a DP algorithm called RAPPOR in Chrome usage statistics. In summary: "novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users".
apple  privacy  anonymization  google  rappor  algorithms  sampling  populations  statistics  differential-privacy 
june 2016 by jm
The NSA’s SKYNET program may be killing thousands of innocent people
Death by Random Forest: this project is a horrible misapplication of machine learning. Truly appalling, when a false positive means death:

The NSA evaluates the SKYNET program using a subset of 100,000 randomly selected people (identified by their MSIDN/MSI pairs of their mobile phones), and a a known group of seven terrorists. The NSA then trained the learning algorithm by feeding it six of the terrorists and tasking SKYNET to find the seventh. This data provides the percentages for false positives in the slide above.

"First, there are very few 'known terrorists' to use to train and test the model," Ball said. "If they are using the same records to train the model as they are using to test the model, their assessment of the fit is completely bullshit. The usual practice is to hold some of the data out of the training process so that the test includes records the model has never seen before. Without this step, their classification fit assessment is ridiculously optimistic."

The reason is that the 100,000 citizens were selected at random, while the seven terrorists are from a known cluster. Under the random selection of a tiny subset of less than 0.1 percent of the total population, the density of the social graph of the citizens is massively reduced, while the "terrorist" cluster remains strongly interconnected. Scientifically-sound statistical analysis would have required the NSA to mix the terrorists into the population set before random selection of a subset—but this is not practical due to their tiny number.

This may sound like a mere academic problem, but, Ball said, is in fact highly damaging to the quality of the results, and thus ultimately to the accuracy of the classification and assassination of people as "terrorists." A quality evaluation is especially important in this case, as the random forest method is known to overfit its training sets, producing results that are overly optimistic. The NSA's analysis thus does not provide a good indicator of the quality of the method.
terrorism  surveillance  nsa  security  ai  machine-learning  random-forests  horror  false-positives  classification  statistics 
february 2016 by jm
The general birthday problem
Good explanation and scipy code for the birthday paradox and hash collisions
hashing  hashes  collisions  birthday-problem  birthday-paradox  coding  probability  statistics 
february 2016 by jm
The Guinness Brewer Who Revolutionized Statistics
William S. Gosset, discoverer of the Student's T-Test. Amazon should have taken note of this trick:
Upon completing his work on the t-distribution, Gosset was eager to make his work public. It was an important finding, and one he wanted to share with the wider world. The managers of Guinness were not so keen on this. They realized they had an advantage over the competition by using this method, and were not excited about relinquishing that leg up. If Gosset were to publish the paper, other breweries would be on to them. So they came to a compromise. Guinness agreed to allow Gosset to publish the finding, as long as he used a pseudonym. This way, competitors would not be able to realize that someone on Guinness’s payroll was doing such research, and figure out that the company’s scientifically enlightened approach was key to their success.
statistics  william-gosset  history  guinness  brewing  t-test  pseudonyms  dublin 
january 2016 by jm
Placebo effects are weak: regression to the mean is the main reason ineffective treatments appear to work
“Statistical regression to the mean predicts that patients selected for abnormalcy will, on the average, tend to improve. We argue that most improvements attributed to the placebo effect are actually instances of statistical regression.”
medicine  science  statistics  placebo  evidence  via:hn  regression-to-the-mean 
december 2015 by jm
Very Fast Reservoir Sampling
via Tony Finch. 'In this post I will demonstrate how to do reservoir sampling orders of magnitude faster than the traditional “naive” reservoir sampling algorithm, using a fast high-fidelity approximation to the reservoir sampling-gap distribution.'
statistics  reservoir-sampling  sampling  algorithms  poisson  bernoulli  performance 
december 2015 by jm
Why Percentiles Don’t Work the Way you Think
Baron Schwartz on metrics, percentiles, and aggregation. +1, although as a HN commenter noted, quantile digests are probably the better fix
performance  percentiles  quantiles  statistics  metrics  monitoring  baron-schwartz  vividcortex 
december 2015 by jm
The reusable holdout: Preserving validity in adaptive data analysis
Useful stats hack from Google: "We show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses."
statistics  google  reusable-holdout  training  ml  machine-learning  data-analysis  holdout  corpus  sampling 
august 2015 by jm
Dublin Bike Theft Survey Results
Dublin Cycling Campaign's survey results: estimated 20,000 bikes stolen per year in Dublin; only 1% of thefts results in a conviction
dublin  bikes  cycling  theft  crime  statistics  infographics  dcc 
may 2015 by jm
Ask the Decoder: Did I sign up for a global sleep study?
How meaningful is this corporate data science, anyway? Given the tech-savvy people in the Bay Area, Jawbone likely had a very dense sample of Jawbone wearers to draw from for its Napa earthquake analysis. That allowed it to look at proximity to the epicenter of the earthquake from location information.

Jawbone boasts its sample population of roughly “1 million Up wearers who track their sleep using Up by Jawbone.” But when looking into patterns county by county in the U.S., Jawbone states, it takes certain statistical liberties to show granularity while accounting for places where there may not be many Jawbone users.

So while Jawbone data can show us interesting things about sleep patterns across a very large population, we have to remember how selective that population is. Jawbone wearers are people who can afford a $129 wearable fitness gadget and the smartphone or computer to interact with the output from the device.

Jawbone is sharing what it learns with the public, but think of all the public health interests or other third parties that might be interested in other research questions from a large scale data set. Yet this data is not collected with scientific processes and controls and is not treated with the rigor and scrutiny that a scientific study requires.

Jawbone and other fitness trackers don’t give us the option to use their devices while opting out of contributing to the anonymous data sets they publish. Maybe that ought to change.
jawbone  privacy  data-protection  anonymization  aggregation  data  medicine  health  earthquakes  statistics  iot  wearables 
march 2015 by jm
Schneier on Security: Why Data Mining Won't Stop Terror
A good reference URL to cut-and-paste when "scanning internet traffic for terrorist plots" rears its head:
This unrealistically accurate system will generate 1 billion false alarms for every real terrorist plot it uncovers. Every day of every year, the police will have to investigate 27 million potential plots in order to find the one real terrorist plot per month. Raise that false-positive accuracy to an absurd 99.9999 percent and you're still chasing 2,750 false alarms per day -- but that will inevitably raise your false negatives, and you're going to miss some of those 10 real plots.


Also, Ben Goldacre saying the same thing: http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/
internet  scanning  filtering  specificity  statistics  data-mining  terrorism  law  nsa  gchq  false-positives  false-negatives 
january 2015 by jm
Introducing practical and robust anomaly detection in a time series
Twitter open-sources an anomaly-spotting R package:
Early detection of anomalies plays a key role in ensuring high-fidelity data is available to our own product teams and those of our data partners. This package helps us monitor spikes in user engagement on the platform surrounding holidays, major sporting events or during breaking news. Beyond surges in social engagement, exogenic factors – such as bots or spammers – may cause an anomaly in number of favorites or followers. The package can be used to find such bots or spam, as well as detect anomalies in system metrics after a new software release. We’re open-sourcing AnomalyDetection because we’d like the public community to evolve the package and learn from it as we have.
statistics  twitter  r  anomaly-detection  outliers  metrics  time-series  spikes  holt-winters 
january 2015 by jm
UncertML
a conceptual model, with accompanying XML schema, that may be used to quantify and exchange complex uncertainties in data. The interoperable model can be used to describe uncertainty in a variety of ways including:

Samples
Statistics including mean, variance, standard deviation and quantile
Probability distributions including marginal and joint distributions and mixture models
via:conor  uncertainty  statistics  xml  formats 
january 2015 by jm
'Uncertain<T>: A First-Order Type for Uncertain Data' [paper, PDF]
'Emerging applications increasingly use estimates such as sensor
data (GPS), probabilistic models, machine learning, big
data, and human data. Unfortunately, representing this uncertain
data with discrete types (floats, integers, and booleans)
encourages developers to pretend it is not probabilistic, which
causes three types of uncertainty bugs. (1) Using estimates
as facts ignores random error in estimates. (2) Computation
compounds that error. (3) Boolean questions on probabilistic
data induce false positives and negatives.
This paper introduces Uncertain<T>, a new programming
language abstraction for uncertain data. We implement a
Bayesian network semantics for computation and conditionals
that improves program correctness. The runtime uses sampling
and hypothesis tests to evaluate computation and conditionals
lazily and efficiently. We illustrate with sensor and
machine learning applications that Uncertain<T> improves
expressiveness and accuracy.'

(via Tony Finch)
via:fanf  uncertainty  estimation  types  strong-typing  coding  probability  statistics  machine-learning  sampling 
december 2014 by jm
Life expectancy increases are due mainly to healthier children, not longer old age
Interesting -- I hadn't expected this.

'Life expectancy at birth [in the US] in 1930 was indeed only 58 for men and 62 for women, and the retirement age was 65. But life expectancy at birth in the early decades of the 20th century was low due mainly to high infant mortality, and someone who died as a child would never have worked and paid into Social Security. A more appropriate measure is probably life expectancy after attainment of adulthood.' .... 'Men who attained age 65 could expect to collect Social Security benefits for almost 13 years (and the numbers are even higher for women).'

In Ireland, life expectancy at birth has increased 18.4 years since 1926 -- but life expectancy for men aged 65 (the pension age) has only increased by 3.8 years. This means that increased life expectancy figures are not particularly relevant to the "pension crunch" story.

Via Fred Logue: https://twitter.com/fplogue/status/532093184646873089
via:fplogue  statistics  taxes  life-expectancy  pensions  infant-mortality  health  1930s 
november 2014 by jm
FelixGV/tehuti
Felix says:

'Like I said, I'd like to move it to a more general / non-personal repo in the future, but haven't had the time yet. Anyway, you can still browse the code there for now. It is not a big code base so not that hard to wrap one's mind around it.

It is Apache licensed and both Kafka and Voldemort are using it so I would say it is pretty self-contained (although Kafka has not moved to Tehuti proper, it is essentially the same code they're using, minus a few small fixes missing that we added).

Tehuti is a bit lower level than CodaHale (i.e.: you need to choose exactly which stats you want to measure and the boundaries of your histograms), but this is the type of stuff you would build a wrapper for and then re-use within your code base. For example: the Voldemort RequestCounter class.'
asl2  apache  open-source  tehuti  metrics  percentiles  quantiles  statistics  measurement  latency  kafka  voldemort  linkedin 
october 2014 by jm
Tehuti
An embryonic metrics library for Java/Scala from Felix GV at LinkedIn, extracted from Kafka's metric implementation and in the new Voldemort release. It fixes the major known problems with the Meter/Timer implementations in Coda-Hale/Dropwizard/Yammer Metrics.

'Regarding Tehuti: it has been extracted from Kafka's metric implementation. The code was originally written by Jay Kreps, and then maintained improved by some Kafka and Voldemort devs, so it definitely is not the work of just one person. It is in my repo at the moment but I'd like to put it in a more generally available (git and maven) repo in the future. I just haven't had the time yet...

As for comparing with CodaHale/Yammer, there were a few concerns with it, but the main one was that we didn't like the exponentially decaying histogram implementation. While that implementation is very appealing in terms of (low) memory usage, it has several misleading characteristics (a lack of incoming data points makes old measurements linger longer than they should, and there's also a fairly high possiblity of losing interesting outlier data points). This makes the exp decaying implementation robust in high throughput fairly constant workloads, but unreliable in sparse or spiky workloads. The Tehuti implementation provides semantics that we find easier to reason with and with a small code footprint (which we consider a plus in terms of maintainability). Of course, it is still a fairly young project, so it could be improved further.'

More background at the kafka-dev thread: http://mail-archives.apache.org/mod_mbox/kafka-dev/201402.mbox/%3C131A7649-ED57-45CB-B4D6-F34063267664@linkedin.com%3E
kafka  metrics  dropwizard  java  scala  jvm  timers  ewma  statistics  measurement  latency  sampling  tehuti  voldemort  linkedin  jay-kreps 
october 2014 by jm
tinystat - GoDoc
tinystat is used to compare two or more sets of measurements (e.g., runs of a multiple runs of benchmarks of two possible implementations) and determine if they are statistically different, using Student's t-test. It's inspired largely by FreeBSD's ministat (written by Poul-Henning Kamp).
t-test  student  statistics  go  coda-hale  tinystat  stats  tools  command-line  unix 
september 2014 by jm
CausalImpact: A new open-source package for estimating causal effects in time series
How can we measure the number of additional clicks or sales that an AdWords campaign generated? How can we estimate the impact of a new feature on app downloads? How do we compare the effectiveness of publicity across countries?

In principle, all of these questions can be answered through causal inference.

In practice, estimating a causal effect accurately is hard, especially when a randomised experiment is not available. One approach we've been developing at Google is based on Bayesian structural time-series models. We use these models to construct a synthetic control — what would have happened to our outcome metric in the absence of the intervention. This approach makes it possible to estimate the causal effect that can be attributed to the intervention, as well as its evolution over time.

We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.

Today, we're excited to announce the release of CausalImpact, an open-source R package that makes causal analyses simple and fast. With its release, all of our advertisers and users will be able to use the same powerful methods for estimating causal effects that we've been using ourselves.

Our main motivation behind creating the package has been to find a better way of measuring the impact of ad campaigns on outcomes. However, the CausalImpact package could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, or the political and social sciences.
causal-inference  r  google  time-series  models  bayes  adwords  advertising  statistics  estimation  metrics 
september 2014 by jm
Punished for Being Poor: Big Data in the Justice System
This is awful. Totally the wrong tool for the job -- a false positive rate which is miniscule for something like spam filtering, could translate to a really horrible outcome for a human life.
Currently, over 20 states use data-crunching risk-assessment programs for sentencing decisions, usually consisting of proprietary software whose exact methods are unknown, to determine which individuals are most likely to re-offend. The Senate and House are also considering similar tools for federal sentencing. These data programs look at a variety of factors, many of them relatively static, like criminal and employment history, age, gender, education, finances, family background, and residence. Indiana, for example, uses the LSI-R, the legality of which was upheld by the state’s supreme court in 2010. Other states use a model called COMPAS, which uses many of the same variables as LSI-R and even includes high school grades. Others are currently considering the practice as a way to reduce the number of inmates and ensure public safety. (Many more states use or endorse similar assessments when sentencing sex offenders, and the programs have been used in parole hearings for years.) Even the American Law Institute has embraced the practice, adding it to the Model Penal Code, attesting to the tool’s legitimacy.



(via stroan)
via:stroan  statistics  false-positives  big-data  law  law-enforcement  penal-code  risk  sentencing 
august 2014 by jm
Monitoring Reactive Applications with Kamon
"quality monitoring tools for apps built in Akka, Spray and Play!". Uses Gil Tene's HDRHistogram and dropwizard Metrics under the hood.
metrics  dropwizard  hdrhistogram  gil-tene  kamon  akka  spray  play  reactive  statistics  java  scala  percentiles  latency 
may 2014 by jm
Daylight saving time linked to heart attacks, study finds
Switching over to daylight saving time, and losing one hour of sleep, raised the risk of having a heart attack the following Monday by 25 per cent, compared to other Mondays during the year, according to a new US study released today. [...] The study found that heart attack risk fell 21 per cent later in the year, on the Tuesday after the clock was returned to standard time, and people got an extra hour’s sleep.

One clear answer: we need 25-hour days.

More details: http://www.sciencedaily.com/releases/2014/03/140329175108.htm --
Researchers used Michigan's BMC2 database, which collects data from all non-federal hospitals across the state, to identify admissions for heart attacks requiring percutaneous coronary intervention from Jan. 1, 2010 through Sept. 15, 2013. A total of 42,060 hospital admissions occurring over 1,354 days were included in the analysis. Total daily admissions were adjusted for seasonal and weekday variation, as the rate of heart attacks peaks in the winter and is lowest in the summer and is also greater on Mondays and lower over the weekend. The hospitals included in this study admit an average of 32 patients having a heart attack on any given Monday. But on the Monday immediately after springing ahead there were on average an additional eight heart attacks. There was no difference in the total weekly number of percutaneous coronary interventions performed for either the fall or spring time changes compared to the weeks before and after the time change.
daylight  dst  daylight-savings  time  dates  calendar  science  health  heart-attacks  michigan  hospitals  statistics 
march 2014 by jm
Analyzing Citibike Usage
Abe Stanway crunches the stats on Citibike usage in NYC, compared to the weather data from Wunderground.
data  correlation  statistics  citibike  cycling  nyc  data-science  weather 
march 2014 by jm
How the search for flight AF447 used Bayesian inference
Via jgc, the search for the downed Air France flight was optimized using this technique:

'Metron’s approach to this search planning problem is rooted in classical Bayesian inference,
which allows organization of available data with associated uncertainties and computation of the
Probability Distribution Function (PDF) for target location given these data. In following this
approach, the first step was to gather the available information about the location of the impact site
of the aircraft. This information was sometimes contradictory and filled with ambiguities and
uncertainties. Using a Bayesian approach we organized this material into consistent scenarios,
quantified the uncertainties with probability distributions, weighted the relative likelihood of each
scenario, and performed a simulation to produce a prior PDF for the location of the wreck.'
metron  bayes  bayesian-inference  machine-learning  statistics  via:jgc  air-france  disasters  probability  inference  searching 
march 2014 by jm
Sacked Google worker says staff ratings fixed to fit template
Allegations of fixing to fit the stack-ranking curve: 'someone at Google always had to get a low score “of 2.9”, so the unit could match the bell curve. She said senior staff “calibrated” the ratings supplied by line managers to ensure conformity with the template and these calibrations could reduce a line manager’s assessment of an employee, in effect giving them the poisoned score of less than three.'
stack-ranking  google  ireland  employment  work  bell-curve  statistics  eric-schmidt 
march 2014 by jm
"A data scientist is a ..."
"A data scientist is a statistician who lives in San Francisco" - slide from Monkigras this year. lols
data-scientist  statistics  statistician  funny  jokes  san-francisco  tech  monkigras 
february 2014 by jm
Nassim Taleb: retire Standard Deviation
Use the mean absolute deviation [...] it corresponds to "real life" much better than the first—and to reality. In fact, whenever people make decisions after being supplied with the standard deviation number, they act as if it were the expected mean deviation.'

Graydon Hoare in turn recommends the median absolute deviation. I prefer percentiles, anyway ;)
statistics  standard-deviation  stddev  maths  nassim-taleb  deviation  volatility  rmse  distributions 
january 2014 by jm
Statsite
A C reimplementation of Etsy's statsd, with some interesting memory optimizations.
Statsite is designed to be both highly performant, and very flexible. To achieve this, it implements the stats collection and aggregation in pure C, using libev to be extremely fast. This allows it to handle hundreds of connections, and millions of metrics. After each flush interval expires, statsite performs a fork/exec to start a new stream handler invoking a specified application. Statsite then streams the aggregated metrics over stdin to the application, which is free to handle the metrics as it sees fit. This allows statsite to aggregate metrics and then ship metrics to any number of sinks (Graphite, SQL databases, etc). There is an included Python script that ships metrics to graphite.
statsd  graphite  statsite  performance  statistics  service-metrics  metrics  ops 
november 2013 by jm
"Effective Computation of Biased Quantiles over Data Streams" [paper]

Skew is prevalent in many data sources such as IP traffic streams.To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two prob-lems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively, using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the “high-biased” quantiles and the “targeted” quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures.Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over high-speed data streams.


Implemented as a timer-histogram storage system in http://armon.github.io/statsite/ .
statistics  quantiles  percentiles  stream-processing  skew  papers  histograms  latency  algorithms 
november 2013 by jm
_Availability in Globally Distributed Storage Systems_ [pdf]
empirical BigTable and GFS failure numbers from Google are orders of magnitude higher than naïve independent-failure models. (via kragen)
via:kragen  failure  bigtable  gfs  statistics  outages  reliability 
september 2013 by jm
Fat Tails
Nice d3.js demo of the fat-tailed distribution:
A fat-tailed distribution looks normal but the parts far away from the average are thicker, meaning a higher chance of huge deviations. [...] Fat tails don't mean more variance; just different variance. For a given variance, a higher chance of extreme deviations implies a lower chance of medium ones.
dataviz  via:hn  statistics  visualization  distributions  fat-tailed  kurtosis  d3.js  javascript  variance  deviation 
july 2013 by jm
Boundary's Early Warnings alarm
Anomaly detection on network throughput metrics, alarming if throughputs on selected flows deviate by 1, 2, or 3 standard deviations from a historical baseline.
network-monitoring  throughput  boundary  service-metrics  alarming  ops  statistics 
june 2013 by jm
Not the ‘best in the world’ - The Medical Independent
Debunking this prolife talking point:
'Our maternity services are amongst the best in the world’. This phrase has been much hackneyed since the heartbreaking death of Savita Halappanavar was revealed in mid October. James Reilly and other senior politicians are particularly guilty of citing this inaccurate position. So what is the state of Irish maternity services and how do our figures compare with other comparable countries? Let’s start with the statistics.


The bottom line:
Eight deaths per 100,000 is not bad, but it ranks our maternity services far from the best in world and below countries such as Slovakia and Poland.
pro-choice  ireland  savita  medicine  health  maternity  morbidity  statistics 
april 2013 by jm
good blog post on histogram-estimation stream processing algorithms
After reviewing several dozen papers, a score or so in depth, I identified two data structures that appear to enable us to answer these recency and frequency queries: exponential histograms (from "Maintaining Stream Statistics Over Sliding Windows" by Datar et al.) and waves (from "Distributed Streams Algorithms for Sliding Windows" by Gibbons and Tirthapura). Both of these data structures are used to solve the so-called counting problem, the problem of determining, with a bound on the relative error, the number of 1s in the last N units of time. In other words, the data structures are able to answer the question: how many 1s appeared in the last n units of time within a factor of Error (e.g., 50%). The algorithms are neat, so I'll present them briefly.
streams  streaming  stream-processing  histograms  percentiles  estimation  waves  statistics  algorithms 
february 2013 by jm
Distributed Streams Algorithms for Sliding Windows [PDF]
'Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate
functions over a “sliding window” of the N most recent data items in one or more streams. [...] Our results are obtained using a novel family of synopsis data structures called waves.'
waves  papers  streaming  algorithms  percentiles  histogram  distcomp  distributed  aggregation  statistics  estimation  streams 
february 2013 by jm
Cycling in Dublin City: the numbers
7.6% of the Dublin commuter population "mainly cycle". some interesting stats here
statistics  dublin  ireland  cycling  commuting  travel 
february 2013 by jm
High Scalability - Analyzing billions of credit card transactions and serving low-latency insights in the cloud
Hadoop, a batch-generated read-only Voldemort cluster, and an intriguing optimal-storage histogram bucketing algorithm:
The optimal histogram is computed using a random-restart hill climbing approximated algorithm.
The algorithm has been shown very fast and accurate: we achieved 99% accuracy compared to an exact dynamic algorithm, with a speed increase of one factor. [...] The amount of information to serve in Voldemort for one year of BBVA's credit card transactions on Spain is 270 GB. The whole processing flow would run in 11 hours on a cluster of 24 "m1.large" instances. The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month.
scalability  scaling  voldemort  hadoop  batch  algorithms  histograms  statistics  bucketing  percentiles 
february 2013 by jm
Reddit’s ranking algorithms
so Reddit uses the Wilson score confidence interval approach, it turns out; more details here (via Toby diPasquale)
ranking  rating  algorithms  popularity  python  wilson-score-interval  sorting  statistics  confidence-sort 
january 2013 by jm
Dan McKinley :: Whom the Gods Would Destroy, They First Give Real-time Analytics
'It's important to divorce the concepts of operational metrics and product analytics. [..] Funny business with timeframes can coerce most A/B tests into statistical significance.' 'The truth is that there are very few product decisions that can be made in real time.'

HN discussion: http://news.ycombinator.com/item?id=5032588
real-time  analytics  statistics  a-b-testing 
january 2013 by jm
English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU
Amazing how consistent the n-gram counts are between Peter Norvig's analysis (here) against the 20120701 Google Books corpus, and Mark Mayzner's 20,000-word corpus from the early 1960s
english  statistics  n-grams  words  etaoin-shrdlu  peter-norvig  mark-mayzner 
january 2013 by jm
Dan McKinley :: Effective Web Experimentation as a Homo Narrans
Good demo from Etsy's A/B testing, of how the human brain can retrofit a story onto statistically-insignificant results. To fix: 'avoid building tooling that enables fishing expeditions; limit our post-hoc rationalization by explicitly constraining it before the experiment. Whenever we test a feature on Etsy, we begin the process by identifying metrics that we believe will change if we 1) understand what is happening and 2) get the effect we desire.'
testing  etsy  statistics  a-b-testing  fishing  ulysses-contract  brain  experiments 
january 2013 by jm
Universal properties of mythological networks - Abstract - EPL (Europhysics Letters) - IOPscience
Abstract:

As in statistical physics, the concept of universality plays an important, albeit qualitative, role in the field of comparative mythology. Here we apply statistical mechanical tools to analyse the networks underlying three iconic mythological narratives with a view to identifying common and distinguishing quantitative features. Of the three narratives, an Anglo-Saxon and a Greek text are mostly believed by antiquarians to be partly historically based while the third, an Irish epic [jm: "An Táin Bó Cúailnge", The Tain, to be specific], is often considered to be fictional. Here we use network analysis in an attempt to discriminate real from imaginary social networks and place mythological narratives on the spectrum between them. This suggests that the perceived artificiality of the Irish narrative can be traced back to anomalous features associated with six characters. Speculating that these are amalgams of several entities or proxies, renders the plausibility of the Irish text comparable to the others from a network-theoretic point of view.


Here's what the Irish Times said:

The society in the 1st century story of the Táin Bó Cúailnge looked artificial at first analysis of the networks between 404 characters in the story. However, the researchers found the society reflected real rather than fictional networks when the weakest links to six of the characters are removed.

These six characters included Medb, Queen of Connacht; Conchobor, King of Ulster and Cúchulainn. They were "similar to superheroes of the Marvel universe" and are "too superhuman" or too well-connected to be real, researchers said. The researchers suggest that each of these superhuman characters may be an amalgam of many which became fused and exaggerated as the story was passed down orally through generations.
networks  society  the-tain  epics  history  mythology  ireland  statistics  network-analysis  papers 
july 2012 by jm
Determining response times with tcprstat
'Tcprstat is a free, open-source TCP analysis tool that watches network traffic and computes the delay between requests and responses. From this it derives response-time statistics and prints them out.' Computes percentiles, too
tcp  tcprstat  tcp-ip  networking  measurement  statistics  performance  instrumentation  linux  unix  tools  cli 
november 2011 by jm
Bayes' theorem ruled inadmissible in UK law courts
Bayes' theorem, and 'similar statistical analysis', ruled inadmissible in UK law courts (via Tony Finch)
uk  law  guardian  via:fanf  bayes  maths  statistics  legal 
october 2011 by jm
Microsoft's new IE "Ribbon" debunked
'nobody — almost literally 0% of users — uses the menu bar, and only 10% of users use the command bar. Nearly everybody is using the context menu or hotkeys. So the solution, obviously, is to make both the menu bar and the command bar bigger and more prominent. Right?
Microsoft UI has officially entered the realm of self-parody.' (via Nelson)
design  hci  microsoft  ui  statistics  user-hostile  ribbon  windows 
august 2011 by jm
bump2babe - The Consumer Guide to Maternity Services in Ireland
wow, they've done a really good job on the statistics collation here
statistics  birth  childbirth  ireland  health  maternity 
june 2011 by jm
Dublin bikes revisited
Fantastic comparative number crunching on the JC Decaux Dublin Bikes scheme, compared to their other European cities (Brussels, Lyons, Paris, Seville), times of day, busiest stations, rainfall, etc.
bikes  dublin-bikes  cycling  dublin  ireland  jc-decaux  number-crunching  analysis  statistics  from delicious
february 2011 by jm
Wired: how a Toronto statistician cracked the state lottery
'The tic-tac-toe lottery was seriously flawed. It took a few hours of studying his tickets and some statistical sleuthing, but he discovered a defect in the game: The visible numbers turned out to reveal essential information about the digits hidden under the latex coating. Nothing needed to be scratched off—the ticket could be cracked if you knew the secret code.'
toronto  hacks  money  statistics  probability  wired  tic-tac-toe  singleton  from delicious
february 2011 by jm
Eirgrid System Demand
'The system demand displayed here represents the electricity production required to meet [Irish] national electricity consumption, including system losses, but net of generators' requirements. It includes power imported via the interconnector and an estimate of the power produced by wind generators, but excludes some non-centrally monitored generation (i.e. small scale CHP).' via Juan Flynn
via:juanflynn  eirgrid  national-grid  ireland  power  charts  statistics  from delicious
august 2010 by jm
On The Record » Guest post – 500 Words of Summer – Mumblin’ Deaf Ro
Dublin-based musician Ronan Hession argues that the illegal-downloading bogeyman is vapour with a bunch of persuasive stats
mumblin-deaf-ro  statistics  music  irma  music-industry  piracy  filesharing  from delicious
august 2010 by jm
SEO Is Mostly Quack Science
'There is no hypothesis being tested here. It's just graphs, and misleading graphs at that. The sad part is, SEOMoz is as close as the SEO industry comes to real science. They may be presenting specious results in hopes of looking like they know what they're talking about, but at least they are collecting some sort of data. Everything else in the field is either anecdotal hocus-pocus or a decree from Matt Cutts. When you hire an SEO consultant, what you are really paying for is domain experience in the not-failing-at-web-design field.'
seo  ted-dziuba  rants  science  seomoz  quality  correlation  statistics  google  from delicious
june 2010 by jm
PeteSearch: How to split up the US
wow. fascinating results from social-network cluster analysis of Facebook, splitting up the entire USA into 7 clusters
clusters  facebook  data  statistics  maps  culture  analytics  datamining  demographics  socialnetworking  graph  dataviz  from delicious
february 2010 by jm
glTail.rb - realtime logfile visualization
'View real-time data and statistics from any logfile on any server with SSH, in an intuitive and entertaining way', supporting postfix/spamd/clamd logs among loads of others. very cool if a little silly
dataviz  visualization  tail  gltail  opengl  linux  apache  spamd  spamassassin  logs  statistics  sysadmin  analytics  animation  analysis  server  ruby  monitoring  logging  logfiles 
july 2009 by jm

related tags

1930s  a-b-testing  accidents  advertising  adwords  aggregation  ai  air-france  akka  alarming  algorithms  analysis  analytics  animation  anomaly-detection  anonymization  apache  apple  asl2  autocorrelation  averaging  baron-schwartz  batch  bayes  bayesian-inference  bell-curve  bernoulli  big-data  bigtable  bikes  birth  birthday-paradox  birthday-problem  boardgames  boundary  brain  brewing  bucketing  calendar  canada  cars  causal-inference  charts  childbirth  children  citibike  classification  cli  cluster-inference  clusters  coda-hale  coding  collisions  command-line  commuting  confidence-sort  corpus  correlation  crime  culture  cycling  d3.js  data  data-analysis  data-mining  data-protection  data-science  data-scientist  data-structures  datamining  dataviz  dates  daylight  daylight-savings  dcc  demographics  design  deviation  diagrams  differential-privacy  digests  disasters  distcomp  distributed  distributions  donald-trump  dropwizard  dst  dublin  dublin-bikes  dynamic  earthquakes  eirgrid  employment  english  epics  eric-schmidt  estimation  etaoin-shrdlu  etsy  evidence  ewma  experiments  facebook  fail  failure  false-negatives  false-positives  fat-tailed  filesharing  filtering  fishing  fmri  formats  fps  funny  games  gchq  gfs  gil-tene  gltail  go  golang  google  graph  graphite  guardian  guinness  hacks  hadoop  hashes  hashing  hci  hdr  hdrhistogram  health  heart-attacks  hilary-clinton  histogram  histograms  history  holdout  holt-winters  horror  hospitals  image-macros  infant-mortality  inference  infographics  infoviz  infrastructure  instrumentation  internet  iot  ireland  irma  java  javascript  jawbone  jay-kreps  jc-decaux  john-rauser  jokes  jvm  kafka  kamon  kids  kurtosis  languages  latency  law  law-enforcement  legal  life-expectancy  linkedin  linux  logfiles  logging  logs  machine-learning  maps  mark-mayzner  maternity  maths  mean  measurement  medicine  memes  metrics  metron  michigan  microsoft  ml  models  money  monitoring  monkigras  morbidity  mri  mumblin-deaf-ro  music  music-industry  mysql  mythology  n-grams  nassim-taleb  national-grid  network-analysis  network-monitoring  networking  networks  neuroimaging  nitsan-wakart  nsa  number-crunching  nyc  nytimes  open-source  opengl  ops  outages  outliers  p99  panels  papers  penal-code  pensions  percentiles  performance  peter-norvig  piracy  placebo  play  poisson  politics  polling  polls  popularity  populations  power  privacy  pro-choice  probability  programming  propaganda  pseudonyms  python  quality  quantiles  r  random-forests  ranking  rants  rappor  rating  ratings  reactive  real-time  regression-to-the-mean  reliability  reservoir-sampling  reusable-holdout  ribbon  risk  rmse  roads  ruby  safety  sampling  san-francisco  savita  scala  scalability  scaling  scanning  science  searching  security  sentencing  seo  seomoz  server  service-metrics  silicon-valley  singleton  skew  socialnetworking  society  sorting  spamassassin  spamd  specificity  spikes  spray  stack-ranking  standard-deviation  static-typing  statistician  statistics  stats  statsd  statsite  stddev  stream-processing  streaming  streams  strong-typing  student  surveillance  sysadmin  t-digest  t-test  tail  taxes  tcp  tcp-ip  tcprstat  tech  ted-dziuba  tehuti  terrorism  testing  the-tain  theft  throughput  tic-tac-toe  time  time-series  timers  tinystat  tools  toronto  training  travel  trump  twitter  types  ui  uk  ulysses-contract  uncertainty  unix  us  us-politics  user-hostile  variance  via:conor  via:fanf  via:fplogue  via:hn  via:jgc  via:juanflynn  via:kragen  via:reddit  via:stroan  visualisation  visualization  vividcortex  volatility  voldemort  waves  wearables  weather  william-gosset  wilson-score-interval  windows  wired  words  work  xml 

Copy this bookmark:



description:


tags: