Trump Literally Doesn't Understand Time Zones
It seems that there is no way to overestimate our president’s ignorance. Yet it’s still hard to comprehend the fact that, according to a new Politico article, he has no idea how time zones work.

Politico reports that Trump’s time zone confusion came up on a “constant basis.” For example, he’d ask to call Shinzo Abe, the Prime Minister of Japan, during the afternoon in Washington—the middle of the night Japan time.

“He wasn’t great with recognizing that the leader of a country might be 80 or 85 years old and isn’t going to be awake or in the right place at 10:30 or 11 p.m. their time,” a former Trump NSC official told Politico. “When he wants to call someone, he wants to call someone. He’s more impulsive that way. He doesn’t think about what time it is or who it is,” another source reported.

“He’s the president of the United States. He’s not stopping to add up [time differences],” a source told Politico. “I don’t think anybody would expect him or Obama or Bush or Clinton or anybody to do that. That’s the whole reason you have a staff to say ‘Yes, we’ll set it up,’ and then they find a time that makes most sense.” Ok sure, but I’m going to guess that Obama and Clinton were at least aware of the fact that only one side of Earth faces towards the sun at any given time.
Peter Campbell · Why does it take so long to mend an escalator? · LRB 7 March 2002
Stepping onto an escalator is an act of faith. From time to time you see people poised at the top, advised by instinct not to launch themselves onto the river of treads. Riding the moving stairs is an adventure for the toddling young and a challenge to the tottering old. Natural hesitancy puts a limit on throughput. London Underground escalators carry passengers at a top speed of 145 feet per minute – close to the maximum allowed under the British Standard specification. There is little temptation to run the machines faster, as trials show that above 160 feet per minute so many people pause timidly that fewer are carried. In the early days they had to be persuaded to get on at all. A one-legged man, ‘Bumper’ Harris, was hired to ride for a whole day on the first installation – it was at Earls Court – to show how easy it was. Some people were sceptical (how had he lost his leg?) but others broke their journey there just to ride up and down.
8 days ago
A Spectre is Haunting Unicode
In 1978 Japan's Ministry of Economy, Trade and Industry established the encoding that would later be known as JIS X 0208, which still serves as an important reference for all Japanese encodings. However, after the JIS standard was released people noticed something strange - several of the added characters had no obvious sources, and nobody could tell what they meant or how they should be pronounced. Nobody was sure where they came from. These are what came to be known as the ghost characters (幽霊文字).
16 days ago
'Biking while black': Chicago minority areas see the most bike tickets
As Chicago police ramp up their ticketing of bicyclists, more than twice as many citations are being written in African-American communities than in white or Latino areas, a Tribune review of police statistics has found.

The top 10 community areas for bike tickets from 2008 to Sept. 22, 2016, include seven that are majority African-American and three that are majority Latino. From the areas with the most tickets written to the least, they are Austin, North Lawndale, Humboldt Park, South Lawndale, Chicago Lawn, West Englewood, Roseland, West Garfield Park, New City and South Chicago.

Not a single majority-white area ranked in the top 10, despite biking's popularity in white areas such as West Town and Lincoln Park.

African-American cyclist Patric McCoy, 70, said he's experienced the heightened enforcement firsthand.
sdss  policing  mapping 
17 days ago
A Song of Ice and Databases: A Game of Thrones Data Model
Relevant to my Wire spreadsheet:

Starting from episode one, the storyline was intense, dynamic, and full of twists. George R.R. Martin did a great job of writing A Song of Ice and Fire, the multi-book series on which Game of Thrones is based. Only five of the projected seven books in the series are currently completed, and the TV series’ storyline is now ahead of the published books.

We can’t find out what will happen in the next season. Until then, let’s try something completely different. Let’s create the Game of Thrones database model.
sdss  data-modeling  spreadsheets 
18 days ago

Good comment: We're used to simple things in the United States like "123 Maple Lane" but even those addresses can be awfully complex. And then you get oddities like Portland's 0234 SW Bancroft St; the leading 0 is significant. Hawaii addresses are like "96-3208 Maile St, Pahala"
reference  geocoding  geospatial  data-munging 
25 days ago
Download the Gang Databases We Got From Illinois State… — ProPublica
There’s info that’s unverified, subjective and simply wrong, yet government officials can access and use it, with potentially troubling consequences.
policing  databases  publicrecords  foia  sdss 
25 days ago
What data on 20 million traffic stops can tell us about ‘driving while black’
The book is based on data on 20 million traffic stops in North Carolina. Where did those data come from and what kinds of information do they contain?

In the late-1990s, the concept of “driving while black” began getting national attention. North Carolina became the first state to mandate the collection of traffic stops data in 1999, thanks in large part to efforts by black representatives in the state legislature.

The database includes information on why the driver was pulled over, the outcome of the stop and demographic information about the driver. It also has an anonymous identification number for each officer as well as the time of the stop and the police agency that conducted it.

The initial law focused only on the State Highway Patrol, but it was expanded two years later to cover almost every police agency in the state. As a result, we have a record of virtually every traffic stop in the state since 2002.
policing  data-journalism  sdss 
25 days ago
The Absurdly Underestimated Dangers of CSV Injection
I’ve been doing the local usergroup circuit with this lately and have been asked to write it up.

In some ways this is old news, but in other ways…well, I think few realize how absolutely devastating and omnipresent this vulnerability can be. It is an attack vector available in every application I’ve ever seen that takes user input and allows administrators to bulk export to CSV.

That is just about every application.

Edit: Credit where due, I’ve been pointed to this article from 2014 by an actual security pro which discusses some of these vectors. And another one.

So let’s set the scene - imagine a time or ticket tracking app. Users enter their time (or tickets) but cannot view those of other users. A site administrator then comes along and exports entries to a csv file, opening it up in a spreadsheet application. Pretty standard stuff.
security  excel  spreadsheets  netsec 
26 days ago
Training for manipulating all kinds of things: Using Multi-byte Characters To Nullify SQL Injection Sanitizing
There are a number of hazards that using multiple character sets and multi-byte character sets can expose web applications to. This article will examine the normal method of sanitizing strings in SQL statements, research into multi-byte character sets, and the hazards they can introduce.

SQL Injection and Sanitizing
Web applications sanitize the apostrophe (') character in strings coming from user input being passed to SQL statements using an escape (\) character. The hex code for the escape character is 0x5c. When an attacker puts an apostrophe into a user input, the ' is turned into \' during the sanitizing process. The DBMS does not treat \' as a string delimiter and thusly the attacker (in normal circumstances) is prevented from terminating the string and injecting malicious SQL into the statement.
sql  unicode  databases  hacking 
26 days ago
TSA Third Party Prescreening - Federal Business Opportunities: Opportunities

pay $$$
& hand over personal info
for a _chance_
at a faster security line.
i paid,
i handed over,
i got my boarding pass!
i did not get that chance,
which, bah but whatever—
it makes me wonder—
who wrote the algorithm
that gets to decide??
government  algorithms  compciv 
4 weeks ago
Neural networks, explained – Physics World
Users of neural networks also have to make sure their algorithm has actually solved the correct problem. Otherwise, undetected biases in the input datasets may produce unintended results. For example, Roberto Novoa, a clinical dermatologist at Stanford University in the US, has described a time when he and his colleagues designed an algorithm to recognize skin cancer – only to discover that they’d accidentally designed a ruler detector instead, because the largest tumours had been photographed with rulers next to them for scale. Another group, this time at the University of Washington, demonstrated a deliberately bad algorithm that was, in theory, supposed to classify husky dogs and wolves, but actually functioned as a snow detector: they’d trained their algorithm with a dataset in which most of the wolf pictures had snowy backgrounds.
neural-networks  AI  machine-learning 
5 weeks ago
CIA archives document Agency’s decades of ASCII woes • MuckRock
In the ‘60s, the US federal government saw a need for a unified standard for digitally encoding information. Lyndon Johnson’s 1968 executive order on computer standards directed federal agencies to convert all of their databases to the new character encoding standard: the American Standard Code for Information Interchange, or ASCII.

Although more powerful and flexible standards have since appeared - most notably Unicode, created to enable people to use computers in any language - ASCII became ubiquitous, and remains foundational to computing. It was the most popular encoding on the web until 2007.

The new requirement applied to all federal agencies, including the Central Intelligence Agency. At first the Agency had no objections. In a November 1965 letter to the Secretary of Commerce uncovered in CREST, Director William Raborn signalled the CIA’s support of the standardization effort.
unicode  text  foia 
5 weeks ago
Python 3 at Facebook [LWN.net]
Fried started working at Facebook in 2013 and he quickly found that he needed to teach himself Python because it was much easier to get code reviewed if it was in Python. At some point later, he found that he was the driving force behind Python 3 adoption at Facebook. He never had a plan to do that, it just came about as he worked with Python more and more.
facebook  python 
6 weeks ago
How Florida ignited the heroin epidemic: A Palm Beach Post investigation

Palm Beach Post reporter Pat Peall, who presented at the IRE conference on using health data, has a new project out showing what you can do with it:


Florida provided the spark that ignited the heroin epidemic, Pat found. Her analysis of CDC data on fatal overdoses shows relationships – other opioid deaths dropping, and heroin deaths rising. And those relationships get stronger closer to Florida, and are tied to when Florida began cracking down on it pill mills.

Pat used DEA reports and court records to show how, after other states had implemented prescription drug monitoring programs, Florida’s pill mills were supplying most states east of the Mississippi. An unprecedented and admittedly illegal marketing campaign from Purdue Pharma helped stoke demand for opioid pills.

And when Florida finally cracked down on the pill mills of South Florida, El Chapo was ready with a heroin supply. Reporter Lawrence Mower was told that Florida’s pills stopped one day in West Virginia, and the next day heroin was on the streets.

Pat wrote something like 40,000 words, about half a book. There are wonderful explanatory stories.

If you’re more into a graphical presentation, take a look at data reporter Mahima Singh’s approach:
investigations  sdss 
6 weeks ago
Timeless Debugging of Complex Software | Root Cause Analysis of a Non-Deterministic JavaScriptCore Bug Ret2 Systems Blog
In software security, root cause analysis (RCA) is the process used to “remove the mystery” from irregular software execution and measure the security impact of such asymmetries. This process will often involve some form of user controlled input (a Proof-of-Concept) that causes a target application to crash or misbehave otherwise.

This post documents the process of performing root cause analysis against a non-deterministic bug we discovered while fuzzing JavaScriptCore for Pwn2Own 2018. Utilizing advanced record-replay debugging technology from Mozilla, we will identify the underlying bug and use our understanding of the issue to speculate on its exploitability.
debugging  security 
7 weeks ago
Selecting comma separated data as multiple rows with SQLite
A while back I needed to split data stored in one column as a comma separated string into multiple rows in a SQL query from a SQLite database.
sqlite  sql  snippets 
7 weeks ago
Why journalists should cover local jails | Poynter
While the nation's attention is focused on immigration detention centers along the U.S. border, more than 11 million people will spend time in local jails. They are caught in a complex and expensive system that treats poor people and minorities more severely. Most people in American jails have not been convicted of a crime. Many cannot afford even a few hundred dollars bail to get out awaiting trial.
journalism  justice  crime 
7 weeks ago
A Beginner's Guide to Firewalling with pf
This guide is written for the person very new to firewalling. Please realize that the sample firewall we build should not be considered appropriate for actual use. I just try to cover a few basics, that took me awhile to grasp from the better known (and more detailed) documentation referenced below

It's my hope that this guide will not only get you started, but give you enough of a grasp of using pf so that you will then be able to go to those more advanced guides and perfect your firewalling skills.

The pf packet filter was developed for OpenBSD but is now included in FreeBSD, which is where I've used it. Having it run at boot and the like is covered in the various documents, however I'll quickly run through the steps for FreeBSD.
security  netsec  linux  guide 
7 weeks ago
I discovered a browser bug - JakeArchibald.com
I accidentally discovered a huge browser bug a few months ago and I'm pretty excited about it. Security engineers always seem like the "cool kids" to me, so I'm hoping that now I can be part of the club, and y'know, get into the special parties or whatever.

I've noticed that a lot of these security disclosure things are only available as PDFs. Personally, I prefer the web, but if you're a SecOps PDF addict, check out the PDF version of this post.

Oh, I guess the vulnerability needs an extremely tenuous name and logo right? Here goes:
security  http  chrome 
7 weeks ago
Twitter as Data
The rise of the internet and mobile telecommunications has created the possibility of using large datasets to understand behavior at unprecedented levels of temporal and geographic resolution. Online social networks attract the most users, though users of these new technologies provide their data through multiple sources, e.g. call detail records, blog posts, web forums, and content aggregation sites. These data allow scholars to adjudicate between competing theories as well as develop new ones, much as the microscope facilitated the development of the germ theory of disease. Of those networks, Twitter presents an ideal combination of size, international reach, and data accessibility that make it the preferred platform in academic studies. Acquiring, cleaning, and analyzing these data, however, require new tools and processes. This Element introduces these methods to social scientists and provides scripts and examples for downloading, processing, and analyzing Twitter data. All data and code for this Element is available at www.cambridge.org/twitter-as-data
book  twitter  data-mining 
8 weeks ago
David Eads
Hi, I'm David Eads. My work connects journalism, data, and social issues. I build and teach simple, direct solutions that help journalists effectively tell their stories on the web. I contribute to and organize projects that strive for democracy, diversity, and sustainability.

I make Internet journalism, most recently for ProPublica Illinois. I speak and teach about technology. I developed the Tarbell publishing platform. When I lived in Chicago, I organized a community data journalism workshop, and helped start and build FreeGeek Chicago.
8 weeks ago
Walt Hickey
I’m down to work with groups big and small about all sorts of topics related to my work, whether it’s walking undergrads in a stats course through how an article was written with the very techniques they’re learning or speaking in a corporate setting about how to effectively communicate compelling numbers.
8 weeks ago
Most Maps of the New Ebola Outbreak Are Wrong - The Atlantic
On Thursday, the World Health Organization released a map showing parts of the Democratic Republic of the Congo that are currently being affected by Ebola. The map showed four cases in Wangata, one of three “health zones” in the large city of Mbandaka. Wangata, according to the map, lies north of the main city, in a forested area on the other side of a river.

That is not where Wangata is.

#DRC #Ebola cases per Health Zone in Equateur province as of 15 May 2018 https://t.co/Rvh3QCso7J pic.twitter.com/zl88TqG53i

— Peter Salama (@PeteSalama) May 17, 2018
“It’s actually here, in the middle of Mbandaka city,” says Cyrus Sinai, indicating a region about 8 miles farther south, on a screen that he shares with me over Skype.

Almost all the maps of the outbreak zone that have thus far been released contain mistakes of this kind. Different health organizations all seem to use their own maps, most of which contain significant discrepancies. Things are roughly in the right place, but their exact positions can be off by miles, as can the boundaries between different regions.
mapping  maps  compciv  messy-data 
8 weeks ago
Lost in Migration: The American Chinese Menu
This essay is an analysis of 693 restaurant menus in seven American Chinatowns, of what the words “Chinese food” really mean and represent

For Chinatown, food is complicated. Historically, Chinese restaurants were at first considered “pest holes” by white America, plagued with disease and rats. Slowly, however, dining establishments became fascinating to non-Chinese Americans, especially as they began touring Chinese settlements. Slowly, ethnic dishes grew to be central to food businesses that support Chinatown’s economy.
8 weeks ago
Methods of Comparison, Compared / Observable
I know it’s disappointing, but: none of them. No method is better universally, and none of them is “the best” even in the context of the dataset. (There are also a number of methods I did not cover, such as the relative difference.) What’s best depends on what you are trying to show. I’d favor absolute difference here as the simplest option, but log ratio might work if you want to show rate of growth.

There’s another important variable here which we’re ignoring, but which might influence our understanding of the data: population counts. This data is per capita (deaths per 100,000 people per year), which is helpful for understanding how likely any individual is to die, but not the number of people affected. Populations vary widely from county to county, and populations move over time. This makes it especially hard to understand trends that vary both geographically and temporally.
maps  visualizations 
8 weeks ago
[1805.12002] Why Is My Classifier Discriminatory?
Recent attempts to achieve fairness in predictive models focus on the balance between fairness and accuracy. In sensitive applications such as healthcare or criminal justice, this trade-off is often undesirable as any increase in prediction error could have devastating consequences. In this work, we argue that the fairness of predictions should be evaluated in context of the data, and that unfairness induced by inadequate samples sizes or unmeasured predictive variables should be addressed through data collection, rather than by constraining the model. We decompose cost-based metrics of discrimination into bias, variance, and noise, and propose actions aimed at estimating and reducing each term. Finally, we perform case-studies on prediction of income, mortality, and review ratings, confirming the value of this analysis. We find that data collection is often a means to reduce discrimination without sacrificing accuracy.
8 weeks ago
Invisible asymptotes — Remains of the Day
Great ideas are only obvious in retrospect. Amazon Prime – the subscription that ensures you don’t pay for shipping – was a stroke of marketing genius. Former employee Eugene Wei’s blog has of how it came about.


My first job at Amazon was as the first analyst in strategic planning, the forward-looking counterpart to accounting, which records what already happened. We maintained several time horizons for our forward forecasts, from granular monthly forecasts to quarterly and annual forecasts to even five and ten year forecasts for the purposes of fund-raising and, well, strategic planning.
amazon  business 
8 weeks ago
[JDK-8203360] Release Note: Japanese New Era Implementation - Java Bug System

Emperor of Japan is the only governor in the world whose enthronement requires changes in #Java core library.


The government is likely to announce the name of the next Imperial era around April 1, 2019, a month before Crown Prince Naruhito becomes the next emperor, Chief Cabinet Secretary Yoshihide Suga said Thursday.

The government will begin preparations for the change of gengō (era name) on the assumption that the new one will be announced about a month ahead of Naruhito’s ascension to the Chrysanthemum Throne on May 1, according to Suga.

“It takes roughly one month to adjust information systems to the new name in the public and private sectors,” Suga said, adding that they are working under an assumed timeline, and that the government has not decided the date when the name will be released.
datetime  naming-things  java 
8 weeks ago
Predicting Gender Using Historical Data
A common problem for researchers who work with data, especially historians, is that a dataset has a list of people with names but does not identify the gender of the person. Since first names often indicate gender, it should be possible to predict gender using names. However, the gender associated with names can change over time. To illustrate, take the names Madison, Hillary, Jordan, and Monroe. For babies born in the United States, those predominant gender associated with those names has changed over time.
data  statistics 
8 weeks ago
Giorgia Lupi: How we can find ourselves in data | TED Talk
Giorgia Lupi uses data to tell human stories, adding nuance to numbers. In this charming talk, she shares how we can bring personality to data, visualizing even the mundane details of our daily lives and transforming the abstract and uncountable into something that can be seen, felt and directly reconnected to our lives.
8 weeks ago
TensorFlow.js — a practical guide – YellowAnt
Recently, Google introduced it’s most popular machine learning library: TensorFlow in Javascript. With the help of TensorFlow.js one can train and deploy ML models in the browser.

Goodbye to spending eons on complicated steps…
Before you start, I would recommend going through the docs of TensorFlow.js, to get a basic understanding of the context required for this article.
tensorflow  machine-learning 
8 weeks ago
Ahmed BESBES - Data Science Portfolio – Overview and benchmark of traditional and deep learning models in text classification
This article is an extension of a previous one I wrote when I was experimenting sentiment analysis on twitter data. Back in the time, I explored a simple model: a two-layer feed-forward neural network trained on keras. The input tweets were represented as document vectors resulting from a weighted average of the embeddings of the words composing the tweet.

The embedding I used was a word2vec model I trained from scratch on the corpus using gensim. The task was a binary classification and I was able with this setting to achieve 79% accuracy.

The goal of this post is to explore other NLP models trained on the same dataset and then benchmark their respective performance on a given test set.

We'll go through different models: from simple ones relying on a bag-of-word representation to a heavy machinery deploying convolutional/recurrent networks: We'll see if we'll score more than 79% accuracy!
NLP  text-mining  deep-learning 
8 weeks ago
Suicide is desperate. It is hostile. It is tragic. But mostly, it is a bloody mess.
he blood was like Jell-O. That is what blood gets like, after you die, before they tidy up.

Somehow, I had expected it would be gone. The police and coroner spent more than an hour behind the closed door; surely it was someone’s job to clean it up. But when they left, it still covered the kitchen floor like the glazing on a candy apple.

You couldn’t mop it. You needed a dustpan and a bucket.

I got on my knees, slid the pan against the linoleum and lifted chunks to the bucket. It took hours to clean it all up, and even after that we found pools I had missed under the stove and sink.

It wasn’t until I finally stood up that I noticed the pictures from his wallet. The wooden breadboard had been pulled out slightly, and four photographs were spilled across it. “Now what?” I thought with annoyance. “What were the police looking for?”

But then it hit me. The police hadn’t done it. These snapshots — one of my mother, one of our dog and two of my brother and me — had been carefully set out in a row, by my father.

It was his penultimate act, just before he knelt on the floor, put the barrel of a .22 rifle in his mouth, and squeezed the trigger.

He was 46 years old. I was 21. This week marks the 20th anniversary of his death. And I am still cleaning up.
best  longform  depression 
9 weeks ago
🚀 100 Times Faster Natural Language Processing in Python
I also published a Jupyter notebook with the examples I describe in this post.
When we published our Python coreference resolution package✨ last year, we got an amazing feedback from the community and people started to use it for many applications 📚, some very different from our original dialog use-case 👥.

And we discovered that, while the speed was totally fine for dialog messages, it could be really slow 🐌 on larger news articles.

I decided to investigate this in details and the result is NeuralCoref v3.0 which is about 100 times faster 🚀 than the previous version (several thousands words per seconds) while retaining the same accuracy, and the easiness of use and eco-system of a Python library.

In this post I wanted to share a few lessons learned on this project, and in particular:
python  NLP 
9 weeks ago
This Is America’s Richest Zip Code - Bloomberg
The richest zip code in America is just as exclusive and elite as the people who live there. Fisher Island, located just off the coast of Miami, is accessible only by ferry or water taxi and is a haven for the world’s richest.

The 216-acre island has diverse residents, representing over 50 nationalities and professions ranging from professional athletes and supermodels to executives and lawyers.

The average income in Fisher Island, zip code 33109, was $2.5 million in 2015, according to a Bloomberg analysis of 2015 Internal Revenue Service data. That’s $1 million more than the second-place spot, held by zip code 94027 in Silicon Valley, also known as the City of Atherton on the San Francisco Peninsula. The area’s neighbors include Stanford University and Menlo Park, home to Facebook and various tech companies. While the IRS data only provide the averages of tax returns, which can be skewed by outliers, Fisher Island is the only zip code in the Bloomberg analysis where more than half of all tax returns showed an income of over $200,000.
9 weeks ago
‘Anything would be better:’ Critics warn Ottawa’s family-reunification lottery is flawed, open to manipulation - The Globe and Mail
Excel’s method for generating random numbers is “very bad,” according to Université de Montréal computer-science professor Pierre L’Ecuyer, an expert in random-number generation. “It’s a very old generator, and it’s really not state-of-the-art.” Prof. L’Ecuyer’s research has shown that Excel’s random-number generator doesn’t pass certain statistical tests, meaning it’s less random than it appears. Under the current system, “it may be that not everybody has exactly the same chance,” Prof. L’Ecuyer said.

Excel uses pseudo-random number generators, a class of algorithms that rely on formulas to generate numbers. These generators have a key flaw – they rely on a “seed” number to kick off the mathematical process. In the case of Excel, this seed is generated automatically by the application. “If you know one number at one step,” Prof. L’Ecuyer explained, “you can compute all the numbers that will follow.”

This means the process could be exploited by someone with the right skills. It’s happened before: In 1994, IT consultant Daniel Corriveau discovered a pattern in a keno game – which uses a random numbering system – at the Casino de Montréal and won $620,000 in a single evening. An investigation later determined the game was using the same seed number at the start of each day.
randomness  excel  spreadsheets 
9 weeks ago
An Investigative Arsenal: Power Chargers, Document Analysis Tools and More - The New York Times
I have a simple but low-tech trick for keeping track of documents. As I extract individual emails, other documents or audio files, I name them this way: “2017_04_24 Pruitt NMA Naples Fla Calendar Entry.” That date format means that if you have, say, 20 documents in a folder, they will automatically line up chronologically. It is a super fast way to have a timeline of all your primary source documents, and makes it easy to find them instantly in chronological order. Try it out. It is pretty cool.
ISO8601  time 
9 weeks ago
Open Data Policing
Open Data Policing is a project of the Southern Coalition for Social Justice. The site’s North Carolina platform was launched in December 2015. The North Carolina development team consisted of attorney Ian Mance of the Southern Coalition and volunteer developers Colin Copeland, Andy Shapiro, and Dylan Young, all of Durham, NC. The Maryland and Illinois platforms were launched in October 2016 and were developed by Southern Coalition and Caktus Group, with generous support from the Open Society Foundations’ Democracy Fund.
Traffic  open-data  policing 
9 weeks ago
washingtonpost/data-homicides: The Washington Post collected data on more than 52,000 criminal homicides over the past decade in 50 of the largest American cities.
The Washington Post collected data on more than 52,000 criminal homicides over the past decade in 50 of the largest American cities.
datasets  mapping 
9 weeks ago
Washington sues Facebook and Google over failure to disclose political ad spending | TechCrunch
(Note that these don’t add up to the totals mentioned above; these are the numbers filed with the state’s Public Disclosure Committee. 2018 amounts are listed but are necessarily incomplete, so I omitted them.)

At least some of the many payments making up these results are not properly documented, and from the looks of it, this could amount to willful negligence. If a company is operating in a state and taking millions for political ads, it really can’t be unaware of that state’s disclosure laws. Yet according to the lawsuits, even basic data like names and addresses of advertisers and the amounts paid were not collected systematically, let alone made available publicly.
lobbying  facebook  google  politics 
10 weeks ago
The Mouse Vs. The Python | Python Programming from the Frontlines
My name is Mike Driscoll. I am a computer programmer by trade and use Python almost exclusively to make my living. I’m on the wxPython mailing list and their IRC channel (#wxpython) on freenode a lot, so if you’d like to find me, you can do so there.
python  resource 
10 weeks ago
“Behave More Sexually:” How Big Pharma Used Strippers, Guns, and Cash to Push Opioids – Mother Jones
Around 2015, just before overdoses sweeping the country started making national news, a pharmaceutical sales representative in New Jersey faced a dilemma: She wanted to increase her sales but worried that the opioid painkiller she was selling was addictive and dangerous. The medication was called Subsys, and its key ingredient, fentanyl, is a synthetic opioid 100 times stronger than morphine.

When the rep, who requested to go by her initials, M.S., voiced her concerns to her manager, she was told that Subsys patients were “already addicts and their prospects were therefore essentially rock-bottom,” according to a recently unsealed whistleblower lawsuit that M.S. filed after leaving Insys in 2016. To boost her numbers, the manager allegedly advised M.S. to “behave more sexually toward pain-management physicians, to stroke their hands while literally begging for prescriptions,” and to ask for the prescriptions as a “favor.”
data-journalism  pharmalot  best 
10 weeks ago
Study purged voters and felons
Florida purged people based on 80 percent of people’s names
10 weeks ago
Hispanics missing from voter purge list - News - Sarasota Herald-Tribune - Sarasota, FL
A data quirk in the state’s controversial effort to purge convicted felons from the voter rolls appears to have excluded Hispanics in greater numbers than other races.

Only 61 of the 47,763 names on the potential purge list are classified as Hispanic. Hispanics make up 17 percent of the state population, but a little more than one-tenth of 1 percent of the names on the list.

The missing Hispanics could feed into the Democratic Party’s contention that the purge is Jeb Bush’s plan to help his brother win Florida in the November presidential election.

All but one of the state’s Hispanic legislators are Republicans. And Cubans, who make up the largest single segment of the state’s Hispanic population, have traditionally supported the GOP.

“It’s sloppy work to say the least,” said Allie Merzer, spokeswoman for the Florida Democratic Party. “Is it intent? I don’t know. But something doesn’t smell right.”
census  demographics  voting  joins  bad-data 
10 weeks ago
Handling Data about Race and Ethnicity - Learning - Source: An OpenNews project
Here’s what happened. The state—led by Republican superstar Jeb Bush—decided it should purge felons who were legally ineligible to vote from the voter rolls. I could spend the rest of this case study going into the nuances here, but the tl;dr is this: The state developed a list of 47,000 people it said were felons and local elections officials should remove them. Because the list leaned heavily to the left, Democrats cried politics and disenfranchisement. We reporters set out to find out what was what. Were people improperly being stripped of their voting rights? Were felons illegally voting? And remember: the 2000 presidential election in Florida was determined by 537 votes, so removing 47,000 voters could tip an election.

So we have a big list of names and the fate of the democracy in the balance (cough). No problem. This is what data journalism is all about! Take dataset A, compare to dataset B and voila! Magic happens.

Being a competitive guy at a Florida news org, I wanted to do this big. I wanted to show how accurate or not accurate this felons list was, with statistical validity. I wanted to use actual social science to investigate it. A couple of anecdotes and a bunch of quotes wasn’t good enough for the state’s largest newspaper, the St. Petersburg Times (which is now the Tampa Bay Times). So I devised a method that would give us percentages of accuracy with a margin of error. In short, we were going to take a representative sample of names on the list—359 of them—and background check them, all in a day. Each background check cost the paper between $50 and $100, depending on how much information we needed to verify. At a minimum, we needed full names, dates of birth, previous addresses, and a criminal history from the state. I had an army of incredibly talented news researchers working with me, and by the end of the day, we found that 59 percent of the list was easily correct, 37 percent were murky, and four percent, or 220 people, were falsely being targeted for purging. We even talked to a man who faced losing his voting rights because he had the first and last name and date of birth as another man with a Florida criminal conviction. With a massive amount of work and in less than a day, we proved the state’s list released that day was flawed.
data  racial  statistics  data-anecdote  bad-data  census  demographics  data-journalism  best 
10 weeks ago
Drawing Conclusions from Data - Learning - Source: An OpenNews project
Data doesn’t just come from thin air. It’s collected by specific people—or machines—for a specific purpose. There may also be people who have a financial or political interest in the numbers. For example, a police department wants to see crime statistics go down and this may affect how crimes are recorded. You must understand the data generation process, and the types of errors it’s likely to introduce. Many data journalists call this process “interviewing the data.” Here are some questions you can ask:
statistics  data 
10 weeks ago
« earlier      
a-b-testing academic advice ai algorithms amazon analysis analytics angularjs animation api apis apple apps architecture article automation aws backbone bash bayesian best big-data bioinformatics book bots business c caching campaign-finance census cheatsheet cli clinicaltrials clojure code command-line compciv compilers computer computer-vision computing course crime crypto css d3 data data-analysis data-journalism data-mining data-munging data-science data-sharing data-visualization database databases datasets ddj death-data debugging deep-learning deployment design design-example devops digital-humanities diversity django drugs education elections email engineering essay excel facebook fakenews finance flux foia framework funny game game-dev games gaming git github golang google government graphics guide hacking hadoop hardware hash haskell health history howto html html5 http image-processing interesting internet investigations ios java javascript journalism jquery justice language learning linux lisp machine-learning map-reduce mapping maps marketing math medicine mobile mongodb mysql naming-things netsec neural-networks news nlp nodejs nosql nyc nylist ocr oop open-data opencv optimization osx padjo patterns performance photography policing politics postgres prisons privacy programming publicrecords punctuation python r rails react reactjs reference regex research ruby rust scalability science scraping sdss search security semitechnical seo server server-ops shell spreadsheets sql sqlite standards statistics style-guide tdd teaching tensorflow testing text text-mining time tools transparency tutorial twitter typography ui unicode unix ux video vim visualizations web web-design web-development web-scraping writing wtfviz

Copy this bookmark: