3058
Linus is stubborn, persistent, and unyielding to what he sees as bullshit. These... | Hacker News
bookmarking this thread as example for when anti-sjw folks whine about how their anti-sjw comments are downvoted en masse without rebuttal.

https://news.ycombinator.com/item?id=18001906
hacker-news  meta 
3 days ago
Look What You Made Me Do, Chrome – Amy Nguyen – Medium
How to use Chrome Developer Tools to get tickets to Taylor Swift’s next concert
For her upcoming concert, Taylor Swift partnered with Ticketmaster to ensure that only legitimate fans can buy tickets. I’d like to say that I’m a true fan who will do the honest work to get a ticket… but I am also a woman with a computer and I like a challenge.

I ended up having a lot of fun exploring Chrome Developer Tools and I wanted to share what I learned. Here’s what we’ll cover in this post:

How to send code through the Console tab
How to use the Network tab to find relevant activity
XHR breakpoints
Putting this all together to create fake user activity
chrome  web-dev  devtools 
4 days ago
How a Times Software Engineer Ended Up Covering Miss America - The New York Times
There was no database, but the best databases aren’t handed to reporters. They’re handmade. I opened a blank spreadsheet and started digging through years of old news clips and pageant websites. I tracked who was competing where, what titles they were winning, and how.
spreadsheets  data-journalism  best 
7 days ago
Highest Median Household Income on Record?
Without adjusting for the change in the income questions, 2017 has the highest median household income on record (since 1967). When you adjust for the change, median household income in previous years was just as high.
census  dirty-data 
8 days ago
How We Mapped 1.3m Data Points Using Mapbox - Features - Source: An OpenNews project
A fact of life at the Financial Times is the sheer wealth of cartographic talent here: data visualization editor Alan Smith studied and began his career making maps; visual journalist Chris Campbell crafts some of the FT’s most sumptuous cartographic creations; and interactive design editor Steve Bernard is renowned throughout the interwebs for his encyclopaedic knowledge of QGIS. Accordingly, it was with a combination of excitement and very real dread that, when Alan approached us in late February 2018 to ask if we’d be interested in mapping Britain’s broadband speeds, we said: “Sure! Sounds… great?”
mapping  gis  mapbox  reactjs  data-visualization 
8 days ago
SQLite: Infinite loop due to the ‘order by limit’ optimization
https://news.ycombinator.com/item?id=17964243

A confluence of multiple defects in the code generator associated with the ORDER BY LIMIT optimization causes the prepared statement to enter an infinite loop while running the final query in the following SQL:
debugging  sqlite  sql 
8 days ago
How the U.S. Government Misleads the Public on Afghanistan - The New York Times
More than 2,200 Americans have been killed in the Afghan conflict, and the United States has spent more than $840 billion fighting the Taliban insurgency and paying for relief and reconstruction. The war has become more expensive, in current dollars, than the Marshall Plan, which helped to rebuild Europe after World War II. That investment has created intense pressure for Americans to show the Taliban are losing and the country is improving.
data-visualization 
11 days ago
Select Star SQL
https://news.ycombinator.com/item?id=17905666

This is an interactive book which aims to be the best place on the internet for learning SQL. It is free of charge, free of ads and doesn't require registration or downloads. It helps you learn by running queries against a real-world dataset to complete projects of consequence. It is not a mere reference page — it conveys a mental model for writing SQL.
sql  book  tutorial  database 
17 days ago
How Football Fed Timothy McVeigh’s Despair - POLITICO Magazine
The Oklahoma City bomber was once a promising young Gulf War veteran. His slide into isolation and extremism happened to dovetail with the fate of his beloved Buffalo Bills.
wtf 
18 days ago
Lost in the Storm - The New York Times
Houston thought it was prepared for a major hurricane like Harvey. But as the flooding overwhelmed the city’s emergency systems, families like the Daileys found out too late that they were on their own.
death-data  longform 
20 days ago
He Got Out of Jail Because of Missing Paperwork. Then, Police Say, He Raped and Killed. - The New York Times
Court clerks with clearances are now able to check criminal records themselves through the state’s electronic portal. But because of the law, judges rely on the local police instead, said Lucian Chalfen, a spokesman for the Office of Court Administration.

In Mr. Drayton’s case, state officials said his comprehensive records were sent to the police, the district attorney’s office and the court in Nassau County on July 1, four days before the judge’s bail decision, through an electronic portal called eJustice. But the judge and the prosecutor handling the case never checked the portal, court officials and the Nassau County District Attorney’s office said.
databases  justice  joins 
29 days ago
Data Access - Compressed Mortality File
The Compressed Mortality File (CMF) is comprised of a county-level national mortality file and a corresponding county–level national population file. The mortality file of the CMF contains a select subset of the variables contained in the detailed annual mortality files. Currently, the CMF spans the years 1968- 2016 and is divided into four parts: 1968-78, 1979-88, 1989-98, and 1999-2016. The first two parts are public use files and are available on a CD-ROM (CMF 1968-88 Series 20 No. 2A). The other two parts can be made available on CD-ROMs to researchers under Part II Use Agreements (CMF 1989-98 Series 20 No. 2E and CMF 1999-2016 Series 20 No. 2V). The CMF is also available on CDC WONDER as an online interactive query data base (see Interactive Data Bases and Tables). The CMF is a relatively compact file as it contains only a select set of variables.
sdss  datasets 
4 weeks ago
The Chicago Police Files
The Invisible Institute, in partnership with The Intercept, investigates the corruption, racism, and violence of the Chicago Police Department.
investigation  data-journalism 
5 weeks ago
Trump Literally Doesn't Understand Time Zones
It seems that there is no way to overestimate our president’s ignorance. Yet it’s still hard to comprehend the fact that, according to a new Politico article, he has no idea how time zones work.

Politico reports that Trump’s time zone confusion came up on a “constant basis.” For example, he’d ask to call Shinzo Abe, the Prime Minister of Japan, during the afternoon in Washington—the middle of the night Japan time.

“He wasn’t great with recognizing that the leader of a country might be 80 or 85 years old and isn’t going to be awake or in the right place at 10:30 or 11 p.m. their time,” a former Trump NSC official told Politico. “When he wants to call someone, he wants to call someone. He’s more impulsive that way. He doesn’t think about what time it is or who it is,” another source reported.

“He’s the president of the United States. He’s not stopping to add up [time differences],” a source told Politico. “I don’t think anybody would expect him or Obama or Bush or Clinton or anybody to do that. That’s the whole reason you have a staff to say ‘Yes, we’ll set it up,’ and then they find a time that makes most sense.” Ok sure, but I’m going to guess that Obama and Clinton were at least aware of the fact that only one side of Earth faces towards the sun at any given time.
time 
5 weeks ago
Peter Campbell · Why does it take so long to mend an escalator? · LRB 7 March 2002
Stepping onto an escalator is an act of faith. From time to time you see people poised at the top, advised by instinct not to launch themselves onto the river of treads. Riding the moving stairs is an adventure for the toddling young and a challenge to the tottering old. Natural hesitancy puts a limit on throughput. London Underground escalators carry passengers at a top speed of 145 feet per minute – close to the maximum allowed under the British Standard specification. There is little temptation to run the machines faster, as trials show that above 160 feet per minute so many people pause timidly that fewer are carried. In the early days they had to be persuaded to get on at all. A one-legged man, ‘Bumper’ Harris, was hired to ride for a whole day on the first installation – it was at Earls Court – to show how easy it was. Some people were sceptical (how had he lost his leg?) but others broke their journey there just to ride up and down.
6 weeks ago
A Spectre is Haunting Unicode
In 1978 Japan's Ministry of Economy, Trade and Industry established the encoding that would later be known as JIS X 0208, which still serves as an important reference for all Japanese encodings. However, after the JIS standard was released people noticed something strange - several of the added characters had no obvious sources, and nobody could tell what they meant or how they should be pronounced. Nobody was sure where they came from. These are what came to be known as the ghost characters (幽霊文字).
unicode 
7 weeks ago
'Biking while black': Chicago minority areas see the most bike tickets
As Chicago police ramp up their ticketing of bicyclists, more than twice as many citations are being written in African-American communities than in white or Latino areas, a Tribune review of police statistics has found.

The top 10 community areas for bike tickets from 2008 to Sept. 22, 2016, include seven that are majority African-American and three that are majority Latino. From the areas with the most tickets written to the least, they are Austin, North Lawndale, Humboldt Park, South Lawndale, Chicago Lawn, West Englewood, Roseland, West Garfield Park, New City and South Chicago.

Not a single majority-white area ranked in the top 10, despite biking's popularity in white areas such as West Town and Lincoln Park.

African-American cyclist Patric McCoy, 70, said he's experienced the heightened enforcement firsthand.
sdss  policing  mapping 
7 weeks ago
A Song of Ice and Databases: A Game of Thrones Data Model
Relevant to my Wire spreadsheet:

Starting from episode one, the storyline was intense, dynamic, and full of twists. George R.R. Martin did a great job of writing A Song of Ice and Fire, the multi-book series on which Game of Thrones is based. Only five of the projected seven books in the series are currently completed, and the TV series’ storyline is now ahead of the published books.

We can’t find out what will happen in the next season. Until then, let’s try something completely different. Let’s create the Game of Thrones database model.
sdss  data-modeling  spreadsheets 
7 weeks ago
FRANK'S COMPULSIVE GUIDE TO POSTAL ADDRESSES
https://news.ycombinator.com/item?id=17577127

Good comment: We're used to simple things in the United States like "123 Maple Lane" but even those addresses can be awfully complex. And then you get oddities like Portland's 0234 SW Bancroft St; the leading 0 is significant. Hawaii addresses are like "96-3208 Maile St, Pahala"
reference  geocoding  geospatial  data-munging 
8 weeks ago
Download the Gang Databases We Got From Illinois State… — ProPublica
There’s info that’s unverified, subjective and simply wrong, yet government officials can access and use it, with potentially troubling consequences.
policing  databases  publicrecords  foia  sdss 
8 weeks ago
What data on 20 million traffic stops can tell us about ‘driving while black’
The book is based on data on 20 million traffic stops in North Carolina. Where did those data come from and what kinds of information do they contain?

In the late-1990s, the concept of “driving while black” began getting national attention. North Carolina became the first state to mandate the collection of traffic stops data in 1999, thanks in large part to efforts by black representatives in the state legislature.

The database includes information on why the driver was pulled over, the outcome of the stop and demographic information about the driver. It also has an anonymous identification number for each officer as well as the time of the stop and the police agency that conducted it.

The initial law focused only on the State Highway Patrol, but it was expanded two years later to cover almost every police agency in the state. As a result, we have a record of virtually every traffic stop in the state since 2002.
policing  data-journalism  sdss 
8 weeks ago
The Absurdly Underestimated Dangers of CSV Injection
I’ve been doing the local usergroup circuit with this lately and have been asked to write it up.

In some ways this is old news, but in other ways…well, I think few realize how absolutely devastating and omnipresent this vulnerability can be. It is an attack vector available in every application I’ve ever seen that takes user input and allows administrators to bulk export to CSV.

That is just about every application.

Edit: Credit where due, I’ve been pointed to this article from 2014 by an actual security pro which discusses some of these vectors. And another one.

So let’s set the scene - imagine a time or ticket tracking app. Users enter their time (or tickets) but cannot view those of other users. A site administrator then comes along and exports entries to a csv file, opening it up in a spreadsheet application. Pretty standard stuff.
security  excel  spreadsheets  netsec 
9 weeks ago
Training for manipulating all kinds of things: Using Multi-byte Characters To Nullify SQL Injection Sanitizing
There are a number of hazards that using multiple character sets and multi-byte character sets can expose web applications to. This article will examine the normal method of sanitizing strings in SQL statements, research into multi-byte character sets, and the hazards they can introduce.

SQL Injection and Sanitizing
Web applications sanitize the apostrophe (') character in strings coming from user input being passed to SQL statements using an escape (\) character. The hex code for the escape character is 0x5c. When an attacker puts an apostrophe into a user input, the ' is turned into \' during the sanitizing process. The DBMS does not treat \' as a string delimiter and thusly the attacker (in normal circumstances) is prevented from terminating the string and injecting malicious SQL into the statement.
sql  unicode  databases  hacking 
9 weeks ago
TSA Third Party Prescreening - Federal Business Opportunities: Opportunities
https://twitter.com/sarambsimon/status/1017557803010068481

TSA PRECHECK—
pay $$$
& hand over personal info
for a _chance_
at a faster security line.
i paid,
i handed over,
i got my boarding pass!
i did not get that chance,
which, bah but whatever—
it makes me wonder—
who wrote the algorithm
that gets to decide??
government  algorithms  compciv 
10 weeks ago
Neural networks, explained – Physics World
Users of neural networks also have to make sure their algorithm has actually solved the correct problem. Otherwise, undetected biases in the input datasets may produce unintended results. For example, Roberto Novoa, a clinical dermatologist at Stanford University in the US, has described a time when he and his colleagues designed an algorithm to recognize skin cancer – only to discover that they’d accidentally designed a ruler detector instead, because the largest tumours had been photographed with rulers next to them for scale. Another group, this time at the University of Washington, demonstrated a deliberately bad algorithm that was, in theory, supposed to classify husky dogs and wolves, but actually functioned as a snow detector: they’d trained their algorithm with a dataset in which most of the wolf pictures had snowy backgrounds.
neural-networks  AI  machine-learning 
10 weeks ago
CIA archives document Agency’s decades of ASCII woes • MuckRock
In the ‘60s, the US federal government saw a need for a unified standard for digitally encoding information. Lyndon Johnson’s 1968 executive order on computer standards directed federal agencies to convert all of their databases to the new character encoding standard: the American Standard Code for Information Interchange, or ASCII.

Although more powerful and flexible standards have since appeared - most notably Unicode, created to enable people to use computers in any language - ASCII became ubiquitous, and remains foundational to computing. It was the most popular encoding on the web until 2007.

The new requirement applied to all federal agencies, including the Central Intelligence Agency. At first the Agency had no objections. In a November 1965 letter to the Secretary of Commerce uncovered in CREST, Director William Raborn signalled the CIA’s support of the standardization effort.
unicode  text  foia 
10 weeks ago
Python 3 at Facebook [LWN.net]
Fried started working at Facebook in 2013 and he quickly found that he needed to teach himself Python because it was much easier to get code reviewed if it was in Python. At some point later, he found that he was the driving force behind Python 3 adoption at Facebook. He never had a plan to do that, it just came about as he worked with Python more and more.
facebook  python 
12 weeks ago
How Florida ignited the heroin epidemic: A Palm Beach Post investigation
to NICAR-L

Palm Beach Post reporter Pat Peall, who presented at the IRE conference on using health data, has a new project out showing what you can do with it:

https://heroin.palmbeachpost.com



Florida provided the spark that ignited the heroin epidemic, Pat found. Her analysis of CDC data on fatal overdoses shows relationships – other opioid deaths dropping, and heroin deaths rising. And those relationships get stronger closer to Florida, and are tied to when Florida began cracking down on it pill mills.



Pat used DEA reports and court records to show how, after other states had implemented prescription drug monitoring programs, Florida’s pill mills were supplying most states east of the Mississippi. An unprecedented and admittedly illegal marketing campaign from Purdue Pharma helped stoke demand for opioid pills.



And when Florida finally cracked down on the pill mills of South Florida, El Chapo was ready with a heroin supply. Reporter Lawrence Mower was told that Florida’s pills stopped one day in West Virginia, and the next day heroin was on the streets.



Pat wrote something like 40,000 words, about half a book. There are wonderful explanatory stories.



If you’re more into a graphical presentation, take a look at data reporter Mahima Singh’s approach:
investigations  sdss 
12 weeks ago
Timeless Debugging of Complex Software | Root Cause Analysis of a Non-Deterministic JavaScriptCore Bug Ret2 Systems Blog
In software security, root cause analysis (RCA) is the process used to “remove the mystery” from irregular software execution and measure the security impact of such asymmetries. This process will often involve some form of user controlled input (a Proof-of-Concept) that causes a target application to crash or misbehave otherwise.

This post documents the process of performing root cause analysis against a non-deterministic bug we discovered while fuzzing JavaScriptCore for Pwn2Own 2018. Utilizing advanced record-replay debugging technology from Mozilla, we will identify the underlying bug and use our understanding of the issue to speculate on its exploitability.
debugging  security 
12 weeks ago
Selecting comma separated data as multiple rows with SQLite
A while back I needed to split data stored in one column as a comma separated string into multiple rows in a SQL query from a SQLite database.
sqlite  sql  snippets 
12 weeks ago
Why journalists should cover local jails | Poynter
While the nation's attention is focused on immigration detention centers along the U.S. border, more than 11 million people will spend time in local jails. They are caught in a complex and expensive system that treats poor people and minorities more severely. Most people in American jails have not been convicted of a crime. Many cannot afford even a few hundred dollars bail to get out awaiting trial.
journalism  justice  crime 
june 2018
A Beginner's Guide to Firewalling with pf
This guide is written for the person very new to firewalling. Please realize that the sample firewall we build should not be considered appropriate for actual use. I just try to cover a few basics, that took me awhile to grasp from the better known (and more detailed) documentation referenced below

It's my hope that this guide will not only get you started, but give you enough of a grasp of using pf so that you will then be able to go to those more advanced guides and perfect your firewalling skills.

The pf packet filter was developed for OpenBSD but is now included in FreeBSD, which is where I've used it. Having it run at boot and the like is covered in the various documents, however I'll quickly run through the steps for FreeBSD.
security  netsec  linux  guide 
june 2018
I discovered a browser bug - JakeArchibald.com
I accidentally discovered a huge browser bug a few months ago and I'm pretty excited about it. Security engineers always seem like the "cool kids" to me, so I'm hoping that now I can be part of the club, and y'know, get into the special parties or whatever.

I've noticed that a lot of these security disclosure things are only available as PDFs. Personally, I prefer the web, but if you're a SecOps PDF addict, check out the PDF version of this post.

Oh, I guess the vulnerability needs an extremely tenuous name and logo right? Here goes:
security  http  chrome 
june 2018
Twitter as Data
The rise of the internet and mobile telecommunications has created the possibility of using large datasets to understand behavior at unprecedented levels of temporal and geographic resolution. Online social networks attract the most users, though users of these new technologies provide their data through multiple sources, e.g. call detail records, blog posts, web forums, and content aggregation sites. These data allow scholars to adjudicate between competing theories as well as develop new ones, much as the microscope facilitated the development of the germ theory of disease. Of those networks, Twitter presents an ideal combination of size, international reach, and data accessibility that make it the preferred platform in academic studies. Acquiring, cleaning, and analyzing these data, however, require new tools and processes. This Element introduces these methods to social scientists and provides scripts and examples for downloading, processing, and analyzing Twitter data. All data and code for this Element is available at www.cambridge.org/twitter-as-data
book  twitter  data-mining 
june 2018
David Eads
Hi, I'm David Eads. My work connects journalism, data, and social issues. I build and teach simple, direct solutions that help journalists effectively tell their stories on the web. I contribute to and organize projects that strive for democracy, diversity, and sustainability.

I make Internet journalism, most recently for ProPublica Illinois. I speak and teach about technology. I developed the Tarbell publishing platform. When I lived in Chicago, I organized a community data journalism workshop, and helped start and build FreeGeek Chicago.
portfolio 
june 2018
Walt Hickey
I’m down to work with groups big and small about all sorts of topics related to my work, whether it’s walking undergrads in a stats course through how an article was written with the very techniques they’re learning or speaking in a corporate setting about how to effectively communicate compelling numbers.
portfolio 
june 2018
Most Maps of the New Ebola Outbreak Are Wrong - The Atlantic
On Thursday, the World Health Organization released a map showing parts of the Democratic Republic of the Congo that are currently being affected by Ebola. The map showed four cases in Wangata, one of three “health zones” in the large city of Mbandaka. Wangata, according to the map, lies north of the main city, in a forested area on the other side of a river.

That is not where Wangata is.

#DRC #Ebola cases per Health Zone in Equateur province as of 15 May 2018 https://t.co/Rvh3QCso7J pic.twitter.com/zl88TqG53i

— Peter Salama (@PeteSalama) May 17, 2018
“It’s actually here, in the middle of Mbandaka city,” says Cyrus Sinai, indicating a region about 8 miles farther south, on a screen that he shares with me over Skype.

Almost all the maps of the outbreak zone that have thus far been released contain mistakes of this kind. Different health organizations all seem to use their own maps, most of which contain significant discrepancies. Things are roughly in the right place, but their exact positions can be off by miles, as can the boundaries between different regions.
mapping  maps  compciv  messy-data 
june 2018
Lost in Migration: The American Chinese Menu
This essay is an analysis of 693 restaurant menus in seven American Chinatowns, of what the words “Chinese food” really mean and represent


For Chinatown, food is complicated. Historically, Chinese restaurants were at first considered “pest holes” by white America, plagued with disease and rats. Slowly, however, dining establishments became fascinating to non-Chinese Americans, especially as they began touring Chinese settlements. Slowly, ethnic dishes grew to be central to food businesses that support Chinatown’s economy.
data-journalism 
june 2018
Methods of Comparison, Compared / Observable
I know it’s disappointing, but: none of them. No method is better universally, and none of them is “the best” even in the context of the dataset. (There are also a number of methods I did not cover, such as the relative difference.) What’s best depends on what you are trying to show. I’d favor absolute difference here as the simplest option, but log ratio might work if you want to show rate of growth.

There’s another important variable here which we’re ignoring, but which might influence our understanding of the data: population counts. This data is per capita (deaths per 100,000 people per year), which is helpful for understanding how likely any individual is to die, but not the number of people affected. Populations vary widely from county to county, and populations move over time. This makes it especially hard to understand trends that vary both geographically and temporally.
maps  visualizations 
june 2018
[1805.12002] Why Is My Classifier Discriminatory?
Recent attempts to achieve fairness in predictive models focus on the balance between fairness and accuracy. In sensitive applications such as healthcare or criminal justice, this trade-off is often undesirable as any increase in prediction error could have devastating consequences. In this work, we argue that the fairness of predictions should be evaluated in context of the data, and that unfairness induced by inadequate samples sizes or unmeasured predictive variables should be addressed through data collection, rather than by constraining the model. We decompose cost-based metrics of discrimination into bias, variance, and noise, and propose actions aimed at estimating and reducing each term. Finally, we perform case-studies on prediction of income, mortality, and review ratings, confirming the value of this analysis. We find that data collection is often a means to reduce discrimination without sacrificing accuracy.
machine-learning 
june 2018
Invisible asymptotes — Remains of the Day
Great ideas are only obvious in retrospect. Amazon Prime – the subscription that ensures you don’t pay for shipping – was a stroke of marketing genius. Former employee Eugene Wei’s blog has of how it came about.

https://www.theguardian.com/commentisfree/2018/jun/03/theranos-elizabeth-holmes-media-emperors-new-startup



My first job at Amazon was as the first analyst in strategic planning, the forward-looking counterpart to accounting, which records what already happened. We maintained several time horizons for our forward forecasts, from granular monthly forecasts to quarterly and annual forecasts to even five and ten year forecasts for the purposes of fund-raising and, well, strategic planning.
amazon  business 
june 2018
[JDK-8203360] Release Note: Japanese New Era Implementation - Java Bug System
https://twitter.com/tagir_valeev/status/1007419414260486144

Emperor of Japan is the only governor in the world whose enthronement requires changes in #Java core library.

https://www.japantimes.co.jp/news/2018/05/17/national/japan-likely-announce-name-next-imperial-era-around-april-1-2019-suga/

The government is likely to announce the name of the next Imperial era around April 1, 2019, a month before Crown Prince Naruhito becomes the next emperor, Chief Cabinet Secretary Yoshihide Suga said Thursday.

The government will begin preparations for the change of gengō (era name) on the assumption that the new one will be announced about a month ahead of Naruhito’s ascension to the Chrysanthemum Throne on May 1, according to Suga.

“It takes roughly one month to adjust information systems to the new name in the public and private sectors,” Suga said, adding that they are working under an assumed timeline, and that the government has not decided the date when the name will be released.
datetime  naming-things  java 
june 2018
Predicting Gender Using Historical Data
A common problem for researchers who work with data, especially historians, is that a dataset has a list of people with names but does not identify the gender of the person. Since first names often indicate gender, it should be possible to predict gender using names. However, the gender associated with names can change over time. To illustrate, take the names Madison, Hillary, Jordan, and Monroe. For babies born in the United States, those predominant gender associated with those names has changed over time.
data  statistics 
june 2018
Giorgia Lupi: How we can find ourselves in data | TED Talk
Giorgia Lupi uses data to tell human stories, adding nuance to numbers. In this charming talk, she shares how we can bring personality to data, visualizing even the mundane details of our daily lives and transforming the abstract and uncountable into something that can be seen, felt and directly reconnected to our lives.
data-journalism 
june 2018
TensorFlow.js — a practical guide – YellowAnt
Recently, Google introduced it’s most popular machine learning library: TensorFlow in Javascript. With the help of TensorFlow.js one can train and deploy ML models in the browser.

Goodbye to spending eons on complicated steps…
Before you start, I would recommend going through the docs of TensorFlow.js, to get a basic understanding of the context required for this article.
tensorflow  machine-learning 
june 2018
Ahmed BESBES - Data Science Portfolio – Overview and benchmark of traditional and deep learning models in text classification
This article is an extension of a previous one I wrote when I was experimenting sentiment analysis on twitter data. Back in the time, I explored a simple model: a two-layer feed-forward neural network trained on keras. The input tweets were represented as document vectors resulting from a weighted average of the embeddings of the words composing the tweet.

The embedding I used was a word2vec model I trained from scratch on the corpus using gensim. The task was a binary classification and I was able with this setting to achieve 79% accuracy.

The goal of this post is to explore other NLP models trained on the same dataset and then benchmark their respective performance on a given test set.

We'll go through different models: from simple ones relying on a bag-of-word representation to a heavy machinery deploying convolutional/recurrent networks: We'll see if we'll score more than 79% accuracy!
NLP  text-mining  deep-learning 
june 2018
« earlier      
a-b-testing academic advice ai algorithms amazon analysis analytics angularjs animation api apis apple apps architecture article automation aws backbone bash bayesian best big-data bioinformatics book bots business c caching campaign-finance census cheatsheet cli clinicaltrials clojure code command-line compciv compilers computer computer-vision computing course crime crypto css d3 data data-analysis data-journalism data-mining data-munging data-science data-sharing data-visualization database databases datasets ddj death-data debugging deep-learning deployment design design-example devops digital-humanities dirty-data diversity django drugs education elections email engineering essay excel facebook fakenews finance flux foia framework funny game game-dev games gaming git github golang google government graphics guide hacking hadoop hardware hash haskell health history howto html html5 http image-processing interesting internet investigations ios java javascript journalism jquery justice language learning linux lisp machine-learning map-reduce mapping maps marketing math medicine mobile mongodb mysql naming-things netsec neural-networks news nlp nodejs nosql nyc nylist ocr oop open-data opencv optimization osx padjo parsing patterns performance photography policing politics postgres prisons privacy programming publicrecords punctuation python r rails react reactjs reference regex research ruby rust scalability science scraping sdss search security semitechnical seo server server-ops shell spreadsheets sql sqlite standards statistics style-guide tdd teaching tensorflow testing text text-mining time tools transparency tutorial twitter typography ui unicode unix ux video vim visualizations web web-design web-development web-scraping writing wtfviz

Copy this bookmark:



description:


tags: