bad-data   16

Hispanics missing from voter purge list - News - Sarasota Herald-Tribune - Sarasota, FL
A data quirk in the state’s controversial effort to purge convicted felons from the voter rolls appears to have excluded Hispanics in greater numbers than other races.

Only 61 of the 47,763 names on the potential purge list are classified as Hispanic. Hispanics make up 17 percent of the state population, but a little more than one-tenth of 1 percent of the names on the list.

The missing Hispanics could feed into the Democratic Party’s contention that the purge is Jeb Bush’s plan to help his brother win Florida in the November presidential election.

All but one of the state’s Hispanic legislators are Republicans. And Cubans, who make up the largest single segment of the state’s Hispanic population, have traditionally supported the GOP.

“It’s sloppy work to say the least,” said Allie Merzer, spokeswoman for the Florida Democratic Party. “Is it intent? I don’t know. But something doesn’t smell right.”
census  demographics  voting  joins  bad-data 
june 2018 by danwin
Handling Data about Race and Ethnicity - Learning - Source: An OpenNews project
Here’s what happened. The state—led by Republican superstar Jeb Bush—decided it should purge felons who were legally ineligible to vote from the voter rolls. I could spend the rest of this case study going into the nuances here, but the tl;dr is this: The state developed a list of 47,000 people it said were felons and local elections officials should remove them. Because the list leaned heavily to the left, Democrats cried politics and disenfranchisement. We reporters set out to find out what was what. Were people improperly being stripped of their voting rights? Were felons illegally voting? And remember: the 2000 presidential election in Florida was determined by 537 votes, so removing 47,000 voters could tip an election.

So we have a big list of names and the fate of the democracy in the balance (cough). No problem. This is what data journalism is all about! Take dataset A, compare to dataset B and voila! Magic happens.


Being a competitive guy at a Florida news org, I wanted to do this big. I wanted to show how accurate or not accurate this felons list was, with statistical validity. I wanted to use actual social science to investigate it. A couple of anecdotes and a bunch of quotes wasn’t good enough for the state’s largest newspaper, the St. Petersburg Times (which is now the Tampa Bay Times). So I devised a method that would give us percentages of accuracy with a margin of error. In short, we were going to take a representative sample of names on the list—359 of them—and background check them, all in a day. Each background check cost the paper between $50 and $100, depending on how much information we needed to verify. At a minimum, we needed full names, dates of birth, previous addresses, and a criminal history from the state. I had an army of incredibly talented news researchers working with me, and by the end of the day, we found that 59 percent of the list was easily correct, 37 percent were murky, and four percent, or 220 people, were falsely being targeted for purging. We even talked to a man who faced losing his voting rights because he had the first and last name and date of birth as another man with a Florida criminal conviction. With a massive amount of work and in less than a day, we proved the state’s list released that day was flawed.
data  racial  statistics  data-anecdote  bad-data  census  demographics  data-journalism  best 
june 2018 by danwin
CheXNet: an in-depth review – Luke Oakden-Rayner
"Despite the few (fairly minor) misgivings I have about this work … I believe the results. Taking a dataset with poor accuracy, training a deep learning model on them, and then applying them to data with a better ground-truth (even if they are a bit ill-defined) … works.
[...]
This ability to ignore bad labels (often called label noise) is a known property of many machine learning systems, and deep learning in particular seems good at it.
[...]
It is undoubtedly true that label noise is always a negative, but how much of a negative depends on how much noise there is and how it is distributed. In many situations deep learning will do nearly as well with low quality labels as it does with perfect labels. This is great for teams who are facing huge costs to gather large datasets, but it shouldn’t lead to complacency. The better the labels, the better your results. There is a trade-off between labelling effort and performance that is going to be asymptotic, but even small improvements can make or break medical systems."
deep-learning  label-noise  structured-noise  bad-data  robustness  chest  xray 
march 2018 by arsyed
Exploring the ChestXray14 dataset: problems – Luke Oakden-Rayner
"I think this is caused by several things; medical images are large, complex, and share many common elements. But even more, the automated method of mining these labels does not inject random noise when it is inaccurate. The programmatic nature of text mining can lead to consistent, unexpected dependencies or stratification in the data."

"I want to expand this last point briefly, because it is a really important issue for anyone working with medical image data. Radiology reports are not objective, factual descriptions of images. The goal of a radiology report is to provide useful, actionable information to their referrer, usually another doctor. In some ways, the radiologist is guessing what information the referrer wants, and culling the information that will be irrelevant.
[...]
This means that two reports of the same image can contain different ‘labels’, based on the clinical setting, the past history, and who the referrer is (often tailored to the preferences of individual referrers), and who the radiologist is. A hugely abnormal study will often be reported “no obvious change”. A report to a specialist might describe the classic findings for a serious disease, but never mention the disease by name so it doesn’t force the specialist into a particular treatment strategy. A report to a general practitioner might list several possible diseases in a differential and include a treatment strategy. There are so many factors that go into how every single radiology report is framed, and all of it adds structured noise to radiology reports. Each little cluster of cases may have distinct image features that are learnable."

"Medical researchers have been dealing with clinical data stratification for a long time. It is why they spend so much time describing the demographics of their dataset; things like age, sex, income, diet, exercise, and many other things can lead to “hidden” stratification. At a basic level, we should be doing this too; checking that the easily identified demographic characteristics of your train and test data are roughly similar, and reporting them in publications. But it isn’t enough. We also need to roughly know that the distribution of visual appearances is similar across cohorts, which means you need to look at the images."
chest  xray  datasets  critique  bad-data  deep-learning  text-mining  annotation  structured-noise 
january 2018 by arsyed
UK secondary schools full due to migration bay boom | Daily Mail Online
https://twitter.com/clairemilleruk/status/791416547357196288

Almost all the best secondary schools in some areas are now over-subscribed as a baby boom fuelled by migration takes its toll.

New figures show schools rated ‘outstanding’ by Ofsted receive more applicants than there are places in nearly 100 per cent of cases in the most overpopulated regions.
bad-data 
october 2016 by danwin
Bad data PR: how the NSPCC sunk to a new low in data churnalism
when the NSPCC sent out a press release saying that one in ten 12-13 year olds [in the UK] are worried that they are addicted to porn and 12% have participated in sexually explicit videos, dozens of journalists appear to have simply played along – despite there being no report and little explanation of where the figures came from. [....]

"It turns out the study was conducted by a “creative market research” [ie. pay-per-survey] group calledOnePoll. "Generate content and news angles with a OnePoll PR survey, and secure exposure for your brand,” reads the company’s blurb. "Our PR survey team can help draft questions, find news angles, design infographics, write and distribute your story." "The OnePoll survey included just 11 multiple-choice questions, which could be filled in online. Children were recruited via their parents, who were already signed up to OnePoll."


The NSPCC spends 25 million UKP per year on "child protection advice and awareness", so they have the money to do this right. Disappointing.
nspcc  bad-science  bad-data  methodology  surveys  porn  uk  kids  addiction  onepoll  pr  market-research 
april 2015 by jm
The Heilbronn DNA Mixup (Science and Stuff)
"DNA traces of an unknown eastern-European woman had been found at almost 17 crime scenes, including two murders (including a 22 year old police officer) but also car jackings, unprofessional break-ins and on a bullet fired in a marital dispute. The crimes where spread around a large area including south-west Germany, France and Switzerland. It now turns out that the several-hundred-men task force might have really been chasing a phantom. ... The likeliest suspect now are the cotton swabs used to collect evidence at the crime scene. All the swabs used in the forensics works were sourced from the same supplier, a company in northern Germany that employs several eastern-European women that would fit the profile."

https://en.wikipedia.org/wiki/Phantom_of_Heilbronn
crime  forensics  dna  contamination  germany  bad-data 
may 2010 by arsyed

related tags

addiction  advertising  annotation  bad-science  best  census  chest  citations  contamination  crime  critique  data-analysis  data-anecdote  data-curation  data-journalism  data-wrangling  data  dataset  datasets  deep-learning  demographics  digital-humanities  dna  duh  encoding  essential  etl  facebook  forensics  germany  identifiers  impact-factor  imputation  joins  kids  label-noise  ld  linked-data  market-research  metadata  methodology  nspcc  ocr  onepoll  porn  pr  python  quality  racial  resuse  robustness  social-graph  software-failures  statistics  structured-noise  surveys  text-mining  uk  utf-8  utility  voting  xray 

Copy this bookmark:



description:


tags: