cshalizi + lumley.thomas   16

Fast Generalized Linear Models by Database Sampling and One-Step Polishing: Journal of Computational and Graphical Statistics: Vol 0, No 0
"In this article, I show how to fit a generalized linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation query, as long as N^{1/2+δ} observations can be stored in memory, for some δ>0. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car color in New Zealand. "
to:NB  computational_statistics  linear_regression  regression  databases  lumley.thomas  to_teach:statcomp
8 weeks ago by cshalizi
Confidence intervals: not a very strong property - Biased and Inefficient
Cute. (The "Gygax intervals" in paragraph 2 are what I use in teaching to say that coverage, while essential, isn't _enough_.)
statistics  confidence_sets  lumley.thomas  to_teach
9 weeks ago by cshalizi
Recognising when you don’t know - Biased and Inefficient
(Some nice shade is thrown on the difference between machine learning and statistics --- excuse me, "data science".)
february 2019 by cshalizi
The importance of the normality assumption in large public health data sets. - PubMed - NCBI
"It is widely but incorrectly believed that the t-test and linear regression are valid only for Normally distributed outcomes. The t-test and linear regression compare the mean of an outcome variable for different subjects. While these are valid even in very small samples if the outcome variable is Normally distributed, their major usefulness comes from the fact that in large samples they are valid for any distribution. We demonstrate this validity by simulation in extremely non-Normal data. We discuss situations in which in other methods such as the Wilcoxon rank sum test and ordinal logistic regression (proportional odds model) have been recommended, and conclude that the t-test and linear regression often provide a convenient and practical alternative. The major limitation on the t-test and linear regression for inference about associations is not a distributional one, but whether detecting and estimating a difference in the mean of the outcome answers the scientific question at hand."
to:NB  linear_regression  statistics  to_teach:linear_models  lumley.thomas
july 2017 by cshalizi
Biased and Inefficient - What’s the right proof of the Continuous Mapping Theorem?
"A lot of the time I’m happy to treat advanced probability theory as a black box and just use it to call in air strikes on obstacles in the proof."

- Need to think of where to quote this in _Almost None_...
probability  funny:geeky  re:almost_none  lumley.thomas
may 2015 by cshalizi
Biased and Inefficient - At risk of vanishing
"A degree in science, in addition to specific facts about squid, neutrinos, or palladium-catalysed cross-couplings, should teach students what to do with questions about the world. In particular, they should learn to think about what the implications would be of each answer to the question, and know how we might use these implications to rule out some of the answers and reduce our uncertainty about others.
"A degree in the humanities, in addition to specific facts about tenses in French, resource-allocation procedures in village societies, or the development of the Sangam literature,should teach students what to with questions about the world. In particular, they should learn to think about what questions should be asked on a particular topic, the different ways these could be answered, and whose interests are served by systems that promote one question or answer over another."

- I have no opinion about the NZ controversy this post is actually about, but I wanted to preserve those two excellent paragraphs.
education  science  humanities  lumley.thomas
december 2013 by cshalizi
What the NSA can’t do by data mining | Stats Chat
In which T. Lumley takes his turn banging his head against the wall.
national_surveillance_state  data_mining  debunking  lumley.thomas
june 2013 by cshalizi

Copy this bookmark:

description:

tags: