**cshalizi + calibration**
26

Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data | PNAS

may 2018 by cshalizi

"Observational healthcare data, such as electronic health records and administrative claims, offer potential to estimate effects of medical products at scale. Observational studies have often been found to be nonreproducible, however, generating conflicting results even when using the same database to answer the same question. One source of discrepancies is error, both random caused by sampling variability and systematic (for example, because of confounding, selection bias, and measurement error). Only random error is typically quantified but converges to zero as databases become larger, whereas systematic error persists independent from sample size and therefore, increases in relative importance. Negative controls are exposure–outcome pairs, where one believes no causal effect exists; they can be used to detect multiple sources of systematic error, but interpreting their results is not always straightforward. Previously, we have shown that an empirical null distribution can be derived from a sample of negative controls and used to calibrate P values, accounting for both random and systematic error. Here, we extend this work to calibration of confidence intervals (CIs). CIs require positive controls, which we synthesize by modifying negative controls. We show that our CI calibration restores nominal characteristics, such as 95% coverage of the true effect size by the 95% CI. We furthermore show that CI calibration reduces disagreement in replications of two pairs of conflicting observational studies: one related to dabigatran, warfarin, and gastrointestinal bleeding and one related to selective serotonin reuptake inhibitors and upper gastrointestinal bleeding. We recommend CI calibration to improve reproducibility of observational studies."

to:NB
statistics
confidence_sets
madigan.david
calibration
may 2018 by cshalizi

[1709.02012v1] On Fairness and Calibration

september 2017 by cshalizi

"The machine learning community has become increasingly concerned with the potential for bias and discrimination in predictive models, and this has motivated a growing line of work on what it means for a classification procedure to be "fair." In particular, we investigate the tension between minimizing error disparity across different population groups while maintaining calibrated probability estimates. We show that calibration is compatible only with a single error constraint (i.e. equal false-negatives rates across groups), and show that any algorithm that satisfies this relaxation is no better than randomizing a percentage of predictions for an existing classifier. These unsettling findings, which extend and generalize existing results, are empirically confirmed on several datasets."

to:NB
to_read
calibration
prediction
classifiers
kleinberg.jon
via:arsyed
september 2017 by cshalizi

Foster : Prediction in the Worst Case

april 2016 by cshalizi

"A predictor is a method of estimating the probability of future events over an infinite data sequence. One predictor is as strong as another if for all data sequences the former has at most the mean square error (MSE) of the latter. Given any countable set 𝒟 of predictors, we explicitly construct a predictor S that is at least as strong as every element of 𝒟. Finite sample bounds are also given which hold uniformly on the space of all possible data."

to:NB
individual_sequence_prediction
low-regret_learning
prediction
ensemble_methods
have_read
foster.dean_p.
statistics
calibration
april 2016 by cshalizi

[1505.05314] Cross-calibration of probabilistic forecasts

june 2015 by cshalizi

"When providing probabilistic forecasts for uncertain future events, it is common to strive for calibrated forecasts, that is, the predictive distribution should be compatible with the observed outcomes. Several notions of calibration are available in the case of a single forecaster alongside with diagnostic tools and statistical tests to assess calibration in practice. Often, there is more than one forecaster providing predictions, and these forecasters may use information of the others and therefore influence one another. We extend common notions of calibration, where each forecaster is analysed individually, to notions of cross-calibration where each forecaster is analysed with respect to the other forecasters in a natural way. It is shown theoretically and in simulation studies that cross-calibration is a stronger requirement on a forecaster than calibration. Analogously to calibration for individual forecasters, we provide diagnostic tools and statistical tests to assess forecasters in terms of cross-calibration. The methods are illustrated in simulation examples and applied to probabilistic forecasts for inflation rates by the Bank of England."

to:NB
prediction
statistics
calibration
ensemble_methods
june 2015 by cshalizi

Evaluation of Probabilistic Forecasts: Proper Scoring Rules and Moments

march 2015 by cshalizi

"The paper provides an overview of probabilistic forecasting and discusses a the- oretical framework for evaluation of probabilistic forecasts which is based on proper scoring rules and moments. An artificial example of predicting second-order autore- gression and an example of predicting the RTSI stock index are used as illustrations."

to:NB
to_read
prediction
statistics
calibration
march 2015 by cshalizi

Accuracy of forecasts in strategic intelligence

july 2014 by cshalizi

"The accuracy of 1,514 strategic intelligence forecasts abstracted from intelligence reports was assessed. The results show that both discrimination and calibration of forecasts was very good. Discrimination was better for senior (versus junior) analysts and for easier (versus harder) forecasts. Miscalibration was mainly due to underconfidence such that analysts assigned more uncertainty than needed given their high level of discrimination. Underconfidence was more pronounced for harder (versus easier) forecasts and for forecasts deemed more (versus less) important for policy decision making. Despite the observed underconfidence, there was a paucity of forecasts in the least informative 0.4–0.6 probability range. Recalibrating the forecasts substantially reduced underconfidence. The findings offer cause for tempered optimism about the accuracy of strategic intelligence forecasts and indicate that intelligence producers aim to promote informativeness while avoiding overstatement."

to:NB
prediction
to_read
decision-making
calibration
psychology
intelligence_(spying)
july 2014 by cshalizi

Forecast aggregation via recalibration

june 2014 by cshalizi

"It is known that the average of many forecasts about a future event tends to outperform the individual assessments. With the goal of further improving forecast performance, this paper develops and compares a number of models for calibrating and aggregating forecasts that exploit the well-known fact that individuals exhibit systematic biases during judgment and elicitation. All of the models recalibrate judgments or mean judgments via a two-parameter calibration function, and differ in terms of whether (1) the calibration function is applied before or after the averaging, (2) averaging is done in probability or log-odds space, and (3) individual differences are captured via hierarchical modeling. Of the non-hierarchical models, the one that first recalibrates the individual judgments and then averages them in log-odds is the best relative to simple averaging, with 26.7 % improvement in Brier score and better performance on 86 % of the individual problems. The hierarchical version of this model does slightly better in terms of mean Brier score (28.2 %) and slightly worse in terms of individual problems (85 %)."

in_NB
calibration
prediction
ensemble_methods
june 2014 by cshalizi

[1403.3920] Minimum scoring rule inference

march 2014 by cshalizi

"Proper scoring rules are methods for encouraging honest assessment of probability distributions. Just like likelihood, a proper scoring rule can be applied to supply an unbiased estimating equation for any statistical model, and the theory of such equations can be applied to understand the properties of the associated estimator. In this paper we develop some basic scoring rule estimation theory, and explore robustness and interval estimation properties by means of theory and simulations."

to:NB
estimation
prediction
calibration
statistics
dawid.philip
march 2014 by cshalizi

[1401.0398] Theory and Applications of Proper Scoring Rules

march 2014 by cshalizi

"We give an overview of some uses of proper scoring rules in statistical inference, including frequentist estimation theory and Bayesian model selection with improper priors."

to:NB
statistics
prediction
calibration
dawid.philip
model_selection
march 2014 by cshalizi

Probabilistic Forecasting - Annual Review of Statistics and Its Application, 1(1):125

january 2014 by cshalizi

"A probabilistic forecast takes the form of a predictive probability distribution over future quantities or events of interest. Probabilistic forecasting aims to maximize the sharpness of the predictive distributions, subject to calibration, on the basis of the available information set. We formalize and study notions of calibration in a prediction space setting. In practice, probabilistic calibration can be checked by examining probability integral transform (PIT) histograms. Proper scoring rules such as the logarithmic score and the continuous ranked probability score serve to assess calibration and sharpness simultaneously. As a special case, consistent scoring functions provide decision-theoretically coherent tools for evaluating point forecasts. We emphasize methodological links to parametric and nonparametric distributional regression techniques, which attempt to model and to estimate conditional distribution functions; we use the context of statistically postprocessed ensemble forecasts in numerical weather prediction as an example. Throughout, we illustrate concepts and methodologies in data examples."

to:NB
prediction
calibration
statistics
gneiting.tilmann
january 2014 by cshalizi

Schefzik , Thorarinsdottir , Gneiting : Uncertainty Quantification in Complex Simulation Models Using Ensemble Copula Coupling

january 2014 by cshalizi

"Critical decisions frequently rely on high-dimensional output from complex computer simulation models that show intricate cross-variable, spatial and temporal dependence structures, with weather and climate predictions being key examples. There is a strongly increasing recognition of the need for uncertainty quantification in such settings, for which we propose and review a general multi-stage procedure called ensemble copula coupling (ECC), proceeding as follows:

"1. Generate a raw ensemble, consisting of multiple runs of the computer model that differ in the inputs or model parameters in suitable ways.

"2. Apply statistical postprocessing techniques, such as Bayesian model averaging or nonhomogeneous regression, to correct for systematic errors in the raw ensemble, to obtain calibrated and sharp predictive distributions for each univariate output variable individually.

"3. Draw a sample from each postprocessed predictive distribution.

"4. Rearrange the sampled values in the rank order structure of the raw ensemble to obtain the ECC postprocessed ensemble.

"The use of ensembles and statistical postprocessing have become routine in weather forecasting over the past decade. We show that seemingly unrelated, recent advances can be interpreted, fused and consolidated within the framework of ECC, the common thread being the adoption of the empirical copula of the raw ensemble. Depending on the use of Quantiles, Random draws or Transformations at the sampling stage, we distinguish the ECC-Q, ECC-R and ECC-T variants, respectively. We also describe relations to the Schaake shuffle and extant copula-based techniques. In a case study, the ECC approach is applied to predictions of temperature, pressure, precipitation and wind over Germany, based on the 50-member European Centre for Medium-Range Weather Forecasts (ECMWF) ensemble."

to:NB
to_read
statistics
prediction
ensemble_methods
calibration
copulas
model_checking
gneiting.tilmann
to_teach:data-mining
to_teach:undergrad-ADA
"1. Generate a raw ensemble, consisting of multiple runs of the computer model that differ in the inputs or model parameters in suitable ways.

"2. Apply statistical postprocessing techniques, such as Bayesian model averaging or nonhomogeneous regression, to correct for systematic errors in the raw ensemble, to obtain calibrated and sharp predictive distributions for each univariate output variable individually.

"3. Draw a sample from each postprocessed predictive distribution.

"4. Rearrange the sampled values in the rank order structure of the raw ensemble to obtain the ECC postprocessed ensemble.

"The use of ensembles and statistical postprocessing have become routine in weather forecasting over the past decade. We show that seemingly unrelated, recent advances can be interpreted, fused and consolidated within the framework of ECC, the common thread being the adoption of the empirical copula of the raw ensemble. Depending on the use of Quantiles, Random draws or Transformations at the sampling stage, we distinguish the ECC-Q, ECC-R and ECC-T variants, respectively. We also describe relations to the Schaake shuffle and extant copula-based techniques. In a case study, the ECC approach is applied to predictions of temperature, pressure, precipitation and wind over Germany, based on the 50-member European Centre for Medium-Range Weather Forecasts (ECMWF) ensemble."

january 2014 by cshalizi

[1310.0236] Assessing the calibration of high-dimensional ensemble forecasts using rank histograms

october 2013 by cshalizi

"Any decision making process that relies on a probabilistic forecast of future events necessarily requires a calibrated forecast. This paper proposes new methods for empirically assessing forecast calibration in a multivariate setting where the probabilistic forecast is given by an ensemble of equally probable forecast scenarios. Multivariate properties are mapped to a single dimension through a pre-rank function and the calibration is subsequently assessed visually through a histogram of the ranks of the observation's pre-ranks. Average ranking assigns a pre-rank based on the average univariate rank while band depth ranking employs the concept of functional band depth where the centrality of the observation within the forecast ensemble is assessed. Several simulation examples and a case study of temperature forecast trajectories at Berlin Tegel Airport in Germany demonstrate that both multivariate ranking methods can successfully detect various sources of miscalibration and scale efficiently to high dimensional settings."

to:NB
prediction
calibration
statistics
copulas
high-dimensional_statistics
october 2013 by cshalizi

Phys. Rev. E 86, 016213 (2012): Parameter estimation through ignorance

september 2013 by cshalizi

"Dynamical modeling lies at the heart of our understanding of physical systems. Its role in science is deeper than mere operational forecasting, in that it allows us to evaluate the adequacy of the mathematical structure of our models. Despite the importance of model parameters, there is no general method of parameter estimation outside linear systems. A relatively simple method of parameter estimation for nonlinear systems is introduced, based on variations in the accuracy of probability forecasts. It is illustrated on the logistic map, the Henon map, and the 12-dimensional Lorenz96 flow, and its ability to outperform linear least squares in these systems is explored at various noise levels and sampling rates. As expected, it is more effective when the forecast error distributions are non-Gaussian. The method selects parameter values by minimizing a proper, local skill score for continuous probability forecasts as a function of the parameter values. This approach is easier to implement in practice than alternative nonlinear methods based on the geometry of attractors or the ability of the model to shadow the observations. Direct measures of inadequacy in the model, the “implied ignorance,” and the information deficit are introduced."

to:NB
calibration
prediction
estimation
statistical_inference_for_stochastic_processes
time_series
statistics
smith.leonard
september 2013 by cshalizi

[1307.7650] Copula Calibration

july 2013 by cshalizi

"We propose notions of calibration for probabilistic forecasts of general multivariate quantities. Probabilistic copula calibration is a natural analogue of probabilistic calibration in the univariate setting. It can be assessed empirically by checking for the uniformity of the copula probability integral transform (CopPIT), which is invariant under coordinate permutations and coordinatewise strictly monotone transformations of the predictive distribution and the outcome. The CopPIT histogram can be interpreted as a generalization and variant of the multivariate rank histogram, which has been used to check the calibration of ensemble forecasts. Climatological copula calibration is an analogue of marginal calibration in the univariate setting. Methods and tools are illustrated in a simulation study and applied to compare raw numerical model and statistically postprocessed ensemble forecasts of bivariate wind vectors."

in_NB
calibration
prediction
copulas
gneiting.tilmann
statistics
have_read
to_teach:undergrad-ADA
july 2013 by cshalizi

[1306.4943] Failure of Calibration is Typical

june 2013 by cshalizi

"Schervish (1985b) showed that every forecasting system is noncalibrated for uncountably many data sequences that it might see. This result is strengthened here: from a topological point of view, failure of calibration is typical and calibration rare. Meanwhile, Bayesian forecasters are certain that they are calibrated---this invites worries about the connection between Bayesianism and rationality."

--- Are _large_ failures of calibration typical? Or are these trivial violatins?

calibration
prediction
bayesianism
bayesian_consistency
statistics
in_NB
blogged
have_read
--- Are _large_ failures of calibration typical? Or are these trivial violatins?

june 2013 by cshalizi

Calibration results for Bayesian model specification

april 2013 by cshalizi

"When the goal is inference about an unknown θ and prediction of future data D∗ on the basis of data D and background assumptions/judgments B, the process of Bayesian model specification involves two ingredients: the condi- tional probability distributions p(θ|B) and p(D|θ, B). Here we focus on specifying p(D|θ,B), and we argue that calibration considerations — paying attention to how often You get the right answer — should be an integral part of this specifi- cation process. After contrasting Bayes-factor-based and predictive model-choice criteria, we present some calibration results, in fixed- and random-effects Poisson models, relevant to addressing two of the basic questions that arise in Bayesian model specification: (Q1) Is model Mj better than Mj′ ? and (Q2) Is model Mj∗ good enough? In particular, we show that LSF S , a full-sample log score predictive model-choice criterion, has better small-sample model discrimination performance than either DIC or a cross-validation-style log-scoring criterion, in the simulation setting we consider; we examine the large-sample behavior of LSFS; and we (a) demonstrate that the popular posterior predictive tail-area method for answering a question related to Q2 can be poorly calibrated and (b) document the success of a method for calibrating it."

to:NB
to_read
model_selection
bayesianism
calibration
model_checking
hypothesis_testing
re:phil-of-bayes_paper
misspecification
april 2013 by cshalizi

Conditional transformation models - Hothorn - 2013 - Journal of the Royal Statistical Society: Series B (Statistical Methodology) - Wiley Online Library

march 2013 by cshalizi

"The ultimate goal of regression analysis is to obtain information about the conditional distribution of a response given a set of explanatory variables. This goal is, however, seldom achieved because most established regression models estimate only the conditional mean as a function of the explanatory variables and assume that higher moments are not affected by the regressors. The underlying reason for such a restriction is the assumption of additivity of signal and noise. We propose to relax this common assumption in the framework of transformation models. The novel class of semiparametric regression models proposed herein allows transformation functions to depend on explanatory variables. These transformation functions are estimated by regularized optimization of scoring rules for probabilistic forecasts, e.g. the continuous ranked probability score. The corresponding estimated conditional distribution functions are consistent. Conditional transformation models are potentially useful for describing possible heteroscedasticity, comparing spatially varying distributions, identifying extreme events, deriving prediction intervals and selecting variables beyond mean regression effects. An empirical investigation based on a heteroscedastic varying-coefficient simulation model demonstrates that semiparametric estimation of conditional distribution functions can be more beneficial than kernel-based non-parametric approaches or parametric generalized additive models for location, scale and shape."

in_NB
to_read
regression
statistics
prediction
to_teach:undergrad-ADA
buhlmann.peter
density_estimation
calibration
march 2013 by cshalizi

538's Uncertainty Estimates Are As Good As They Get - A.C. Thomas, Scientist

november 2012 by cshalizi

[take] "each prediction and its associated uncertainty, calculate the probability that the observed value (vote share) is greater than a simulated draw from this distribution. The key is that for a large number of independent prediction-uncertainty pairs, we should see a uniform distribution of p-values between 0 and 1.

"I grabbed the estimates from FiveThirtyEight and Votamatic (at this time, I have only estimates, not uncertainties, for PEC or HuffPost) and calculated the respective p-values assuming a normal distribution in each case. Media coverage suggested that Nate Silver's intervals were too conservative; if this were the case, we would expect a higher concentration of p-values around 50%. (If too anti-conservative, the p-values would be more extreme, towards 0 or 1.)

"On the contrary, the 538 distribution is nearly uniform."

- For once, I am quite certain about the to_teach tag.

to_teach:undergrad-ADA
calibration
prediction
statistics
us_politics
model_checking
silver.nate
thomas.andrew_c.
"I grabbed the estimates from FiveThirtyEight and Votamatic (at this time, I have only estimates, not uncertainties, for PEC or HuffPost) and calculated the respective p-values assuming a normal distribution in each case. Media coverage suggested that Nate Silver's intervals were too conservative; if this were the case, we would expect a higher concentration of p-values around 50%. (If too anti-conservative, the p-values would be more extreme, towards 0 or 1.)

"On the contrary, the 538 distribution is nearly uniform."

- For once, I am quite certain about the to_teach tag.

november 2012 by cshalizi

Assessing gross domestic product and inflation probability forecasts derived from Bank of England fan charts - Galbraith - 2011 - Journal of the Royal Statistical Society: Series A (Statistics in Society) - Wiley Online Library

april 2012 by cshalizi

"Density forecasts, including the pioneering Bank of England ‘fan charts’, are often used to produce forecast probabilities of a particular event. We use the Bank of England's forecast densities to calculate the forecast probability that annual rates of inflation and output growth exceed given thresholds. We subject these implicit probability forecasts to graphical and numerical diagnostic checks. We measure both their calibration and their resolution, providing both statistical and graphical interpretations of the results. The results reinforce earlier evidence on limitations of these forecasts and provide new evidence on their information content and on the relative performance of inflation and gross domestic product growth forecasts. In particular, gross domestic product forecasts show little or no ability to predict periods of low growth beyond the current quarter, in part because of the important role of data revisions."

to:NB
prediction
statistics
calibration
macroeconomics
to_teach:undergrad-ADA
april 2012 by cshalizi

Lai , Gross , Shen : Evaluating probability forecasts

november 2011 by cshalizi

"Probability forecasts of events are routinely used in climate predictions, in forecasting default probabilities on bank loans or in estimating the probability of a patient’s positive response to treatment. Scoring rules have long been used to assess the efficacy of the forecast probabilities after observing the occurrence, or nonoccurrence, of the predicted events. We develop herein a statistical theory for scoring rules and propose an alternative approach to the evaluation of probability forecasts. This approach uses loss functions relating the predicted to the actual probabilities of the events and applies martingale theory to exploit the temporal structure between the forecast and the subsequent occurrence or nonoccurrence of the event."

in_NB
statistics
prediction
calibration
to_read
to_teach:undergrad-ADA
november 2011 by cshalizi

Calibration and Econometric Non-Practice

october 2011 by cshalizi

DeLong is missing a trick. The rational-expectations dogmatist could simply insist that the true probability of an event like 2008 in 2008 _was_ 0.02%, and we were just unlucky.

macroeconomics
econometrics
rational_expectations
calibration
re:phil-of-bayes_paper
statistics
delong.brad
model_checking
october 2011 by cshalizi

Making and Evaluating Point Forecasts (Gneiting)

july 2011 by cshalizi

"Typically, point forecasting methods are compared and assessed by means of an error measure or scoring function, with the absolute error and the squared error being key examples. The individual scores are averaged over forecast cases, to result in a summary measure of the predictive performance, such as the mean absolute error or the mean squared error. I demonstrate that this common practice can lead to grossly misguided inferences, unless the scoring function and the forecasting task are carefully matched...."

prediction
statistics
calibration
machine_learning
decision_theory
gneiting.tilmann
have_read
july 2011 by cshalizi

The Monkey Cage: Forecasting Fallacies?

august 2009 by cshalizi

What we have here, boy, is a failure to calibrate: " 'Around 74% of companies have beat forecasts, versus the long-term average of 61% (empahsis added) and the all-time record of 73%, reached in the first quarter of 2004.' Now I might be missing something here, but if the forecasters were good at their jobs, shouldn’t the long term average of companies beating forecasts be the same as the long term average of companies doing worse than the forecasts?" --- Actually, isn't this compatible with the forecasters minimizing squared error under an asymmetric (but mean zero) noise distribution? (A more plausible explanation, to my mind, has to do with corrupt practices, where the same firms solicit investment-banking business from companies and purport to advise investors on what those companies are worth. But that's my cynicism.)

calibration
prediction
financial_markets
to_teach:data-mining
statistics
august 2009 by cshalizi

**related tags**

Copy this bookmark: