**cshalizi + lumley.thomas**
16

Fast Generalized Linear Models by Database Sampling and One-Step Polishing: Journal of Computational and Graphical Statistics: Vol 0, No 0

8 weeks ago by cshalizi

"In this article, I show how to fit a generalized linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation query, as long as N^{1/2+δ} observations can be stored in memory, for some δ>0. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car color in New Zealand. "

to:NB
computational_statistics
linear_regression
regression
databases
lumley.thomas
to_teach:statcomp
8 weeks ago by cshalizi

Confidence intervals: not a very strong property - Biased and Inefficient

9 weeks ago by cshalizi

Cute. (The "Gygax intervals" in paragraph 2 are what I use in teaching to say that coverage, while essential, isn't _enough_.)

statistics
confidence_sets
lumley.thomas
to_teach
9 weeks ago by cshalizi

Recognising when you don’t know - Biased and Inefficient

february 2019 by cshalizi

(Some nice shade is thrown on the difference between machine learning and statistics --- excuse me, "data science".)

classifiers
mushrooms
statistics
to_teach:data-mining
to_teach:undergrad-ADA
lumley.thomas
february 2019 by cshalizi

Thomas Lumley on Twitter: "A graph has the 'casual Markov property' if it's plausible at first glance that none of the omitted edges correspond to real effects."

august 2018 by cshalizi

"A graph has the 'casual Markov property' if it's plausible at first glance that none of the omitted edges correspond to real effects."

graphical_models
funny:geeky
lumley.thomas
re:ADAfaEPoV
august 2018 by cshalizi

The importance of the normality assumption in large public health data sets. - PubMed - NCBI

july 2017 by cshalizi

"It is widely but incorrectly believed that the t-test and linear regression are valid only for Normally distributed outcomes. The t-test and linear regression compare the mean of an outcome variable for different subjects. While these are valid even in very small samples if the outcome variable is Normally distributed, their major usefulness comes from the fact that in large samples they are valid for any distribution. We demonstrate this validity by simulation in extremely non-Normal data. We discuss situations in which in other methods such as the Wilcoxon rank sum test and ordinal logistic regression (proportional odds model) have been recommended, and conclude that the t-test and linear regression often provide a convenient and practical alternative. The major limitation on the t-test and linear regression for inference about associations is not a distributional one, but whether detecting and estimating a difference in the mean of the outcome answers the scientific question at hand."

to:NB
linear_regression
statistics
to_teach:linear_models
lumley.thomas
july 2017 by cshalizi

Biased and Inefficient - What’s the right proof of the Continuous Mapping Theorem?

may 2015 by cshalizi

"A lot of the time I’m happy to treat advanced probability theory as a black box and just use it to call in air strikes on obstacles in the proof."

- Need to think of where to quote this in _Almost None_...

probability
funny:geeky
re:almost_none
lumley.thomas
- Need to think of where to quote this in _Almost None_...

may 2015 by cshalizi

Biased and Inefficient - At risk of vanishing

december 2013 by cshalizi

"A degree in science, in addition to specific facts about squid, neutrinos, or palladium-catalysed cross-couplings, should teach students what to do with questions about the world. In particular, they should learn to think about what the implications would be of each answer to the question, and know how we might use these implications to rule out some of the answers and reduce our uncertainty about others.

"A degree in the humanities, in addition to specific facts about tenses in French, resource-allocation procedures in village societies, or the development of the Sangam literature,should teach students what to with questions about the world. In particular, they should learn to think about what questions should be asked on a particular topic, the different ways these could be answered, and whose interests are served by systems that promote one question or answer over another."

- I have no opinion about the NZ controversy this post is actually about, but I wanted to preserve those two excellent paragraphs.

education
science
humanities
lumley.thomas
"A degree in the humanities, in addition to specific facts about tenses in French, resource-allocation procedures in village societies, or the development of the Sangam literature,should teach students what to with questions about the world. In particular, they should learn to think about what questions should be asked on a particular topic, the different ways these could be answered, and whose interests are served by systems that promote one question or answer over another."

- I have no opinion about the NZ controversy this post is actually about, but I wanted to preserve those two excellent paragraphs.

december 2013 by cshalizi

What the NSA can’t do by data mining | Stats Chat

june 2013 by cshalizi

In which T. Lumley takes his turn banging his head against the wall.

national_surveillance_state
data_mining
debunking
lumley.thomas
june 2013 by cshalizi

**related tags**

Copy this bookmark: