cshalizi + calibration   26

Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data | PNAS
"Observational healthcare data, such as electronic health records and administrative claims, offer potential to estimate effects of medical products at scale. Observational studies have often been found to be nonreproducible, however, generating conflicting results even when using the same database to answer the same question. One source of discrepancies is error, both random caused by sampling variability and systematic (for example, because of confounding, selection bias, and measurement error). Only random error is typically quantified but converges to zero as databases become larger, whereas systematic error persists independent from sample size and therefore, increases in relative importance. Negative controls are exposure–outcome pairs, where one believes no causal effect exists; they can be used to detect multiple sources of systematic error, but interpreting their results is not always straightforward. Previously, we have shown that an empirical null distribution can be derived from a sample of negative controls and used to calibrate P values, accounting for both random and systematic error. Here, we extend this work to calibration of confidence intervals (CIs). CIs require positive controls, which we synthesize by modifying negative controls. We show that our CI calibration restores nominal characteristics, such as 95% coverage of the true effect size by the 95% CI. We furthermore show that CI calibration reduces disagreement in replications of two pairs of conflicting observational studies: one related to dabigatran, warfarin, and gastrointestinal bleeding and one related to selective serotonin reuptake inhibitors and upper gastrointestinal bleeding. We recommend CI calibration to improve reproducibility of observational studies."
to:NB  statistics  confidence_sets  madigan.david  calibration 
may 2018 by cshalizi
[1709.02012v1] On Fairness and Calibration
"The machine learning community has become increasingly concerned with the potential for bias and discrimination in predictive models, and this has motivated a growing line of work on what it means for a classification procedure to be "fair." In particular, we investigate the tension between minimizing error disparity across different population groups while maintaining calibrated probability estimates. We show that calibration is compatible only with a single error constraint (i.e. equal false-negatives rates across groups), and show that any algorithm that satisfies this relaxation is no better than randomizing a percentage of predictions for an existing classifier. These unsettling findings, which extend and generalize existing results, are empirically confirmed on several datasets."
to:NB  to_read  calibration  prediction  classifiers  kleinberg.jon  via:arsyed 
september 2017 by cshalizi
Foster : Prediction in the Worst Case
"A predictor is a method of estimating the probability of future events over an infinite data sequence. One predictor is as strong as another if for all data sequences the former has at most the mean square error (MSE) of the latter. Given any countable set 𝒟 of predictors, we explicitly construct a predictor S that is at least as strong as every element of 𝒟. Finite sample bounds are also given which hold uniformly on the space of all possible data."
to:NB  individual_sequence_prediction  low-regret_learning  prediction  ensemble_methods  have_read  foster.dean_p.  statistics  calibration 
april 2016 by cshalizi
[1505.05314] Cross-calibration of probabilistic forecasts
"When providing probabilistic forecasts for uncertain future events, it is common to strive for calibrated forecasts, that is, the predictive distribution should be compatible with the observed outcomes. Several notions of calibration are available in the case of a single forecaster alongside with diagnostic tools and statistical tests to assess calibration in practice. Often, there is more than one forecaster providing predictions, and these forecasters may use information of the others and therefore influence one another. We extend common notions of calibration, where each forecaster is analysed individually, to notions of cross-calibration where each forecaster is analysed with respect to the other forecasters in a natural way. It is shown theoretically and in simulation studies that cross-calibration is a stronger requirement on a forecaster than calibration. Analogously to calibration for individual forecasters, we provide diagnostic tools and statistical tests to assess forecasters in terms of cross-calibration. The methods are illustrated in simulation examples and applied to probabilistic forecasts for inflation rates by the Bank of England."
to:NB  prediction  statistics  calibration  ensemble_methods 
june 2015 by cshalizi
Evaluation of Probabilistic Forecasts: Proper Scoring Rules and Moments
"The paper provides an overview of probabilistic forecasting and discusses a the- oretical framework for evaluation of probabilistic forecasts which is based on proper scoring rules and moments. An artificial example of predicting second-order autore- gression and an example of predicting the RTSI stock index are used as illustrations."
to:NB  to_read  prediction  statistics  calibration 
march 2015 by cshalizi
Accuracy of forecasts in strategic intelligence
"The accuracy of 1,514 strategic intelligence forecasts abstracted from intelligence reports was assessed. The results show that both discrimination and calibration of forecasts was very good. Discrimination was better for senior (versus junior) analysts and for easier (versus harder) forecasts. Miscalibration was mainly due to underconfidence such that analysts assigned more uncertainty than needed given their high level of discrimination. Underconfidence was more pronounced for harder (versus easier) forecasts and for forecasts deemed more (versus less) important for policy decision making. Despite the observed underconfidence, there was a paucity of forecasts in the least informative 0.4–0.6 probability range. Recalibrating the forecasts substantially reduced underconfidence. The findings offer cause for tempered optimism about the accuracy of strategic intelligence forecasts and indicate that intelligence producers aim to promote informativeness while avoiding overstatement."
to:NB  prediction  to_read  decision-making  calibration  psychology  intelligence_(spying) 
july 2014 by cshalizi
Forecast aggregation via recalibration
"It is known that the average of many forecasts about a future event tends to outperform the individual assessments. With the goal of further improving forecast performance, this paper develops and compares a number of models for calibrating and aggregating forecasts that exploit the well-known fact that individuals exhibit systematic biases during judgment and elicitation. All of the models recalibrate judgments or mean judgments via a two-parameter calibration function, and differ in terms of whether (1) the calibration function is applied before or after the averaging, (2) averaging is done in probability or log-odds space, and (3) individual differences are captured via hierarchical modeling. Of the non-hierarchical models, the one that first recalibrates the individual judgments and then averages them in log-odds is the best relative to simple averaging, with 26.7 % improvement in Brier score and better performance on 86 % of the individual problems. The hierarchical version of this model does slightly better in terms of mean Brier score (28.2 %) and slightly worse in terms of individual problems (85 %)."
in_NB  calibration  prediction  ensemble_methods 
june 2014 by cshalizi
[1403.3920] Minimum scoring rule inference
"Proper scoring rules are methods for encouraging honest assessment of probability distributions. Just like likelihood, a proper scoring rule can be applied to supply an unbiased estimating equation for any statistical model, and the theory of such equations can be applied to understand the properties of the associated estimator. In this paper we develop some basic scoring rule estimation theory, and explore robustness and interval estimation properties by means of theory and simulations."
to:NB  estimation  prediction  calibration  statistics  dawid.philip 
march 2014 by cshalizi
[1401.0398] Theory and Applications of Proper Scoring Rules
"We give an overview of some uses of proper scoring rules in statistical inference, including frequentist estimation theory and Bayesian model selection with improper priors."
to:NB  statistics  prediction  calibration  dawid.philip  model_selection 
march 2014 by cshalizi
Probabilistic Forecasting - Annual Review of Statistics and Its Application, 1(1):125
"A probabilistic forecast takes the form of a predictive probability distribution over future quantities or events of interest. Probabilistic forecasting aims to maximize the sharpness of the predictive distributions, subject to calibration, on the basis of the available information set. We formalize and study notions of calibration in a prediction space setting. In practice, probabilistic calibration can be checked by examining probability integral transform (PIT) histograms. Proper scoring rules such as the logarithmic score and the continuous ranked probability score serve to assess calibration and sharpness simultaneously. As a special case, consistent scoring functions provide decision-theoretically coherent tools for evaluating point forecasts. We emphasize methodological links to parametric and nonparametric distributional regression techniques, which attempt to model and to estimate conditional distribution functions; we use the context of statistically postprocessed ensemble forecasts in numerical weather prediction as an example. Throughout, we illustrate concepts and methodologies in data examples."
to:NB  prediction  calibration  statistics  gneiting.tilmann 
january 2014 by cshalizi
Schefzik , Thorarinsdottir , Gneiting : Uncertainty Quantification in Complex Simulation Models Using Ensemble Copula Coupling
"Critical decisions frequently rely on high-dimensional output from complex computer simulation models that show intricate cross-variable, spatial and temporal dependence structures, with weather and climate predictions being key examples. There is a strongly increasing recognition of the need for uncertainty quantification in such settings, for which we propose and review a general multi-stage procedure called ensemble copula coupling (ECC), proceeding as follows:
"1. Generate a raw ensemble, consisting of multiple runs of the computer model that differ in the inputs or model parameters in suitable ways.
"2. Apply statistical postprocessing techniques, such as Bayesian model averaging or nonhomogeneous regression, to correct for systematic errors in the raw ensemble, to obtain calibrated and sharp predictive distributions for each univariate output variable individually.
"3. Draw a sample from each postprocessed predictive distribution.
"4. Rearrange the sampled values in the rank order structure of the raw ensemble to obtain the ECC postprocessed ensemble.
"The use of ensembles and statistical postprocessing have become routine in weather forecasting over the past decade. We show that seemingly unrelated, recent advances can be interpreted, fused and consolidated within the framework of ECC, the common thread being the adoption of the empirical copula of the raw ensemble. Depending on the use of Quantiles, Random draws or Transformations at the sampling stage, we distinguish the ECC-Q, ECC-R and ECC-T variants, respectively. We also describe relations to the Schaake shuffle and extant copula-based techniques. In a case study, the ECC approach is applied to predictions of temperature, pressure, precipitation and wind over Germany, based on the 50-member European Centre for Medium-Range Weather Forecasts (ECMWF) ensemble."
to:NB  to_read  statistics  prediction  ensemble_methods  calibration  copulas  model_checking  gneiting.tilmann  to_teach:data-mining  to_teach:undergrad-ADA 
january 2014 by cshalizi
[1310.0236] Assessing the calibration of high-dimensional ensemble forecasts using rank histograms
"Any decision making process that relies on a probabilistic forecast of future events necessarily requires a calibrated forecast. This paper proposes new methods for empirically assessing forecast calibration in a multivariate setting where the probabilistic forecast is given by an ensemble of equally probable forecast scenarios. Multivariate properties are mapped to a single dimension through a pre-rank function and the calibration is subsequently assessed visually through a histogram of the ranks of the observation's pre-ranks. Average ranking assigns a pre-rank based on the average univariate rank while band depth ranking employs the concept of functional band depth where the centrality of the observation within the forecast ensemble is assessed. Several simulation examples and a case study of temperature forecast trajectories at Berlin Tegel Airport in Germany demonstrate that both multivariate ranking methods can successfully detect various sources of miscalibration and scale efficiently to high dimensional settings."
to:NB  prediction  calibration  statistics  copulas  high-dimensional_statistics 
october 2013 by cshalizi
Phys. Rev. E 86, 016213 (2012): Parameter estimation through ignorance
"Dynamical modeling lies at the heart of our understanding of physical systems. Its role in science is deeper than mere operational forecasting, in that it allows us to evaluate the adequacy of the mathematical structure of our models. Despite the importance of model parameters, there is no general method of parameter estimation outside linear systems. A relatively simple method of parameter estimation for nonlinear systems is introduced, based on variations in the accuracy of probability forecasts. It is illustrated on the logistic map, the Henon map, and the 12-dimensional Lorenz96 flow, and its ability to outperform linear least squares in these systems is explored at various noise levels and sampling rates. As expected, it is more effective when the forecast error distributions are non-Gaussian. The method selects parameter values by minimizing a proper, local skill score for continuous probability forecasts as a function of the parameter values. This approach is easier to implement in practice than alternative nonlinear methods based on the geometry of attractors or the ability of the model to shadow the observations. Direct measures of inadequacy in the model, the “implied ignorance,” and the information deficit are introduced."
to:NB  calibration  prediction  estimation  statistical_inference_for_stochastic_processes  time_series  statistics  smith.leonard 
september 2013 by cshalizi
[1307.7650] Copula Calibration
"We propose notions of calibration for probabilistic forecasts of general multivariate quantities. Probabilistic copula calibration is a natural analogue of probabilistic calibration in the univariate setting. It can be assessed empirically by checking for the uniformity of the copula probability integral transform (CopPIT), which is invariant under coordinate permutations and coordinatewise strictly monotone transformations of the predictive distribution and the outcome. The CopPIT histogram can be interpreted as a generalization and variant of the multivariate rank histogram, which has been used to check the calibration of ensemble forecasts. Climatological copula calibration is an analogue of marginal calibration in the univariate setting. Methods and tools are illustrated in a simulation study and applied to compare raw numerical model and statistically postprocessed ensemble forecasts of bivariate wind vectors."
in_NB  calibration  prediction  copulas  gneiting.tilmann  statistics  have_read  to_teach:undergrad-ADA 
july 2013 by cshalizi
[1306.4943] Failure of Calibration is Typical
"Schervish (1985b) showed that every forecasting system is noncalibrated for uncountably many data sequences that it might see. This result is strengthened here: from a topological point of view, failure of calibration is typical and calibration rare. Meanwhile, Bayesian forecasters are certain that they are calibrated---this invites worries about the connection between Bayesianism and rationality."

--- Are _large_ failures of calibration typical? Or are these trivial violatins?
calibration  prediction  bayesianism  bayesian_consistency  statistics  in_NB  blogged  have_read 
june 2013 by cshalizi
Calibration results for Bayesian model specification
"When the goal is inference about an unknown θ and prediction of future data D∗ on the basis of data D and background assumptions/judgments B, the process of Bayesian model specification involves two ingredients: the condi- tional probability distributions p(θ|B) and p(D|θ, B). Here we focus on specifying p(D|θ,B), and we argue that calibration considerations — paying attention to how often You get the right answer — should be an integral part of this specifi- cation process. After contrasting Bayes-factor-based and predictive model-choice criteria, we present some calibration results, in fixed- and random-effects Poisson models, relevant to addressing two of the basic questions that arise in Bayesian model specification: (Q1) Is model Mj better than Mj′ ? and (Q2) Is model Mj∗ good enough? In particular, we show that LSF S , a full-sample log score predictive model-choice criterion, has better small-sample model discrimination performance than either DIC or a cross-validation-style log-scoring criterion, in the simulation setting we consider; we examine the large-sample behavior of LSFS; and we (a) demonstrate that the popular posterior predictive tail-area method for answering a question related to Q2 can be poorly calibrated and (b) document the success of a method for calibrating it."
to:NB  to_read  model_selection  bayesianism  calibration  model_checking  hypothesis_testing  re:phil-of-bayes_paper  misspecification 
april 2013 by cshalizi
Conditional transformation models - Hothorn - 2013 - Journal of the Royal Statistical Society: Series B (Statistical Methodology) - Wiley Online Library
"The ultimate goal of regression analysis is to obtain information about the conditional distribution of a response given a set of explanatory variables. This goal is, however, seldom achieved because most established regression models estimate only the conditional mean as a function of the explanatory variables and assume that higher moments are not affected by the regressors. The underlying reason for such a restriction is the assumption of additivity of signal and noise. We propose to relax this common assumption in the framework of transformation models. The novel class of semiparametric regression models proposed herein allows transformation functions to depend on explanatory variables. These transformation functions are estimated by regularized optimization of scoring rules for probabilistic forecasts, e.g. the continuous ranked probability score. The corresponding estimated conditional distribution functions are consistent. Conditional transformation models are potentially useful for describing possible heteroscedasticity, comparing spatially varying distributions, identifying extreme events, deriving prediction intervals and selecting variables beyond mean regression effects. An empirical investigation based on a heteroscedastic varying-coefficient simulation model demonstrates that semiparametric estimation of conditional distribution functions can be more beneficial than kernel-based non-parametric approaches or parametric generalized additive models for location, scale and shape."
in_NB  to_read  regression  statistics  prediction  to_teach:undergrad-ADA  buhlmann.peter  density_estimation  calibration 
march 2013 by cshalizi
538's Uncertainty Estimates Are As Good As They Get - A.C. Thomas, Scientist
[take] "each prediction and its associated uncertainty, calculate the probability that the observed value (vote share) is greater than a simulated draw from this distribution. The key is that for a large number of independent prediction-uncertainty pairs, we should see a uniform distribution of p-values between 0 and 1.
"I grabbed the estimates from FiveThirtyEight and Votamatic (at this time, I have only estimates, not uncertainties, for PEC or HuffPost) and calculated the respective p-values assuming a normal distribution in each case. Media coverage suggested that Nate Silver's intervals were too conservative; if this were the case, we would expect a higher concentration of p-values around 50%. (If too anti-conservative, the p-values would be more extreme, towards 0 or 1.)
"On the contrary, the 538 distribution is nearly uniform."

- For once, I am quite certain about the to_teach tag.
to_teach:undergrad-ADA  calibration  prediction  statistics  us_politics  model_checking  silver.nate  thomas.andrew_c. 
november 2012 by cshalizi
Assessing gross domestic product and inflation probability forecasts derived from Bank of England fan charts - Galbraith - 2011 - Journal of the Royal Statistical Society: Series A (Statistics in Society) - Wiley Online Library
"Density forecasts, including the pioneering Bank of England ‘fan charts’, are often used to produce forecast probabilities of a particular event. We use the Bank of England's forecast densities to calculate the forecast probability that annual rates of inflation and output growth exceed given thresholds. We subject these implicit probability forecasts to graphical and numerical diagnostic checks. We measure both their calibration and their resolution, providing both statistical and graphical interpretations of the results. The results reinforce earlier evidence on limitations of these forecasts and provide new evidence on their information content and on the relative performance of inflation and gross domestic product growth forecasts. In particular, gross domestic product forecasts show little or no ability to predict periods of low growth beyond the current quarter, in part because of the important role of data revisions."
to:NB  prediction  statistics  calibration  macroeconomics  to_teach:undergrad-ADA 
april 2012 by cshalizi
Lai , Gross , Shen : Evaluating probability forecasts
"Probability forecasts of events are routinely used in climate predictions, in forecasting default probabilities on bank loans or in estimating the probability of a patient’s positive response to treatment. Scoring rules have long been used to assess the efficacy of the forecast probabilities after observing the occurrence, or nonoccurrence, of the predicted events. We develop herein a statistical theory for scoring rules and propose an alternative approach to the evaluation of probability forecasts. This approach uses loss functions relating the predicted to the actual probabilities of the events and applies martingale theory to exploit the temporal structure between the forecast and the subsequent occurrence or nonoccurrence of the event."
in_NB  statistics  prediction  calibration  to_read  to_teach:undergrad-ADA 
november 2011 by cshalizi
Calibration and Econometric Non-Practice
DeLong is missing a trick. The rational-expectations dogmatist could simply insist that the true probability of an event like 2008 in 2008 _was_ 0.02%, and we were just unlucky.
macroeconomics  econometrics  rational_expectations  calibration  re:phil-of-bayes_paper  statistics  delong.brad  model_checking 
october 2011 by cshalizi
Making and Evaluating Point Forecasts (Gneiting)
"Typically, point forecasting methods are compared and assessed by means of an error measure or scoring function, with the absolute error and the squared error being key examples. The individual scores are averaged over forecast cases, to result in a summary measure of the predictive performance, such as the mean absolute error or the mean squared error. I demonstrate that this common practice can lead to grossly misguided inferences, unless the scoring function and the forecasting task are carefully matched...."
prediction  statistics  calibration  machine_learning  decision_theory  gneiting.tilmann  have_read 
july 2011 by cshalizi
The Monkey Cage: Forecasting Fallacies?
What we have here, boy, is a failure to calibrate: " 'Around 74% of companies have beat forecasts, versus the long-term average of 61% (empahsis added) and the all-time record of 73%, reached in the first quarter of 2004.' Now I might be missing something here, but if the forecasters were good at their jobs, shouldn’t the long term average of companies beating forecasts be the same as the long term average of companies doing worse than the forecasts?" --- Actually, isn't this compatible with the forecasters minimizing squared error under an asymmetric (but mean zero) noise distribution? (A more plausible explanation, to my mind, has to do with corrupt practices, where the same firms solicit investment-banking business from companies and purport to advise investors on what those companies are worth. But that's my cynicism.)
calibration  prediction  financial_markets  to_teach:data-mining  statistics 
august 2009 by cshalizi

Copy this bookmark: