jm + machine-learning   68

'What’s your ML Test Score? A rubric for ML production systems'
'Using machine learning in real-world production systems is complicated by a host of issues not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for assessing the production-readiness of an ML system. But how much testing and monitoring is enough? We present an ML Test Score rubric based on a set of actionable tests to help quantify these issues.'

Google paper on testable machine learning systems.
machine-learning  testing  ml  papers  google 
yesterday by jm
Build a Better Monster: Morality, Machine Learning, and Mass Surveillance

We built the commercial internet by mastering techniques of persuasion and surveillance that we’ve extended to billions of people, including essentially the entire population of the Western democracies. But admitting that this tool of social control might be conducive to authoritarianism is not something we’re ready to face. After all, we're good people. We like freedom. How could we have built tools that subvert it?

As Upton Sinclair said, “It is difficult to get a man to understand something, when his salary depends on his not understanding it.”

I contend that there are structural reasons to worry about the role of the tech industry in American political life, and that we have only a brief window of time in which to fix this.
advertising  facebook  google  internet  politics  surveillance  democracy  maciej-ceglowski  talks  morality  machine-learning 
9 days ago by jm
'Mathwashing,' Facebook and the zeitgeist of data worship
Fred Benenson: Mathwashing can be thought of using math terms (algorithm, model, etc.) to paper over a more subjective reality. For example, a lot of people believed Facebook was using an unbiased algorithm to determine its trending topics, even if Facebook had previously admitted that humans were involved in the process.
maths  math  mathwashing  data  big-data  algorithms  machine-learning  bias  facebook  fred-benenson 
9 days ago by jm
Zeynep Tufekci: Machine intelligence makes human morals more important | TED Talk | TED.com
Machine intelligence is here, and we're already using it to make subjective decisions. But the complex way AI grows and improves makes it hard to understand and even harder to control. In this cautionary talk, techno-sociologist Zeynep Tufekci explains how intelligent machines can fail in ways that don't fit human error patterns — and in ways we won't expect or be prepared for. "We cannot outsource our responsibilities to machines," she says. "We must hold on ever tighter to human values and human ethics."


More relevant now that nVidia are trialing ML-based self-driving cars in the US...
nvidia  ai  ml  machine-learning  scary  zeynep-tufekci  via:maciej  technology  ted-talks 
9 days ago by jm
Research Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data
Great stuff from Google - this is really nifty stuff for large-scale privacy-preserving machine learning usage:

It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.

Federated Learning allows for smarter models, lower latency, and less power consumption, all while ensuring privacy. And this approach has another immediate benefit: in addition to providing an update to the shared model, the improved model on your phone can also be used immediately, powering experiences personalized by the way you use your phone.

Papers:
https://arxiv.org/pdf/1602.05629.pdf , https://arxiv.org/pdf/1610.05492.pdf
google  ml  machine-learning  training  federated-learning  gboard  models  privacy  data-privacy  data-protection 
22 days ago by jm
[1606.08813] European Union regulations on algorithmic decision-making and a "right to explanation"
We summarize the potential impact that the European Union's new General Data Protection Regulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on user-level predictors) which "significantly affect" users. The law will also effectively create a "right to explanation," whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large challenges for industry, it highlights opportunities for computer scientists to take the lead in designing algorithms and evaluation frameworks which avoid discrimination and enable explanation.


oh this'll be tricky.
algorithms  accountability  eu  gdpr  ml  machine-learning  via:daveb  europe  data-protection  right-to-explanation 
6 weeks ago by jm
When DNNs go wrong – adversarial examples and what we can learn from them
Excellent paper.
[The] results suggest that classifiers based on modern machine learning techniques, even those that obtain excellent performance on the test set, are not learning the true underlying concepts that determine the correct output label. Instead, these algorithms have built a Potemkin village that works well on naturally occuring data, but is exposed as a fake when one visits points in space that do not have high probability in the data distribution.
ai  deep-learning  dnns  neural-networks  adversarial-classification  classification  classifiers  machine-learning  papers 
8 weeks ago by jm
'Rules of Machine Learning: Best Practices for ML Engineering' from Martin Zinkevich
'This document is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google. It presents a style for machine learning, similar to the Google C++ Style Guide and other popular guides to practical programming. If you have taken a class in machine learning, or built or worked on a machine­-learned model, then you have the necessary background to read this document.'

Full of good tips, if you wind up using ML in a production service.
machine-learning  ml  google  production  coding  best-practices 
january 2017 by jm
How a Machine Learns Prejudice - Scientific American
Agreed, this is a big issue.
If artificial intelligence takes over our lives, it probably won’t involve humans battling an army of robots that relentlessly apply Spock-like logic as they physically enslave us. Instead, the machine-learning algorithms that already let AI programs recommend a movie you’d like or recognize your friend’s face in a photo will likely be the same ones that one day deny you a loan, lead the police to your neighborhood or tell your doctor you need to go on a diet. And since humans create these algorithms, they're just as prone to biases that could lead to bad decisions—and worse outcomes.
These biases create some immediate concerns about our increasing reliance on artificially intelligent technology, as any AI system designed by humans to be absolutely "neutral" could still reinforce humans’ prejudicial thinking instead of seeing through it.
prejudice  bias  machine-learning  ml  data  training  race  racism  google  facebook 
january 2017 by jm
Here's Why Facebook's Trending Algorithm Keeps Promoting Fake News - BuzzFeed News
Kalina Bontcheva leads the EU-funded PHEME project working to compute the veracity of social media content. She said reducing the amount of human oversight for Trending heightens the likelihood of failures, and of the algorithm being fooled by people trying to game it.
“I think people are always going to try and outsmart these algorithms — we’ve seen this with search engine optimization,” she said. “I’m sure that once in a while there is going to be a very high-profile failure.”
Less human oversight means more reliance on the algorithm, which creates a new set of concerns, according to Kate Starbird, an assistant professor at the University of Washington who has been using machine learning and other technology to evaluate the accuracy of rumors and information during events such as the Boston bombings.
“[Facebook is] making an assumption that we’re more comfortable with a machine being biased than with a human being biased, because people don’t understand machines as well,” she said.
facebook  news  gaming  adversarial-classification  pheme  truth  social-media  algorithms  ml  machine-learning  media 
october 2016 by jm
Founder of Google X has no concept of how machine learning as policing tool risks reinforcing implicit bias
This is shocking:
At the end of the panel on artificial intelligence, a young black woman asked [Sebastian Thrun, CEO of the education startup Udacity, who is best known for founding Google X] whether bias in machine learning “could perpetuate structural inequality at a velocity much greater than perhaps humans can.” She offered the example of criminal justice, where “you have a machine learning tool that can identify criminals, and criminals may disproportionately be black because of other issues that have nothing to do with the intrinsic nature of these people, so the machine learns that black people are criminals, and that’s not necessarily the outcome that I think we want.”
In his reply, Thrun made it sound like her concern was one about political correctness, not unconscious bias. “Statistically what the machines do pick up are patterns and sometimes we don’t like these patterns. Sometimes they’re not politically correct,” Thrun said. “When we apply machine learning methods sometimes the truth we learn really surprises us, to be honest, and I think it’s good to have a dialogue about this.”


"the truth"! Jesus. We are fucked
google  googlex  bias  racism  implicit-bias  machine-learning  ml  sebastian-thrun  udacity  inequality  policing  crime 
october 2016 by jm
Remarks at the SASE Panel On The Moral Economy of Tech
Excellent talk. I love this analogy for ML applied to real-world data which affects people:
Treating the world as software promotes fantasies of control. And the best kind of control is control without responsibility. Our unique position as authors of software used by millions gives us power, but we don't accept that this should make us accountable. We're programmers—who else is going to write the software that runs the world? To put it plainly, we are surprised that people seem to get mad at us for trying to help. Fortunately we are smart people and have found a way out of this predicament. Instead of relying on algorithms, which we can be accused of manipulating for our benefit, we have turned to machine learning, an ingenious way of disclaiming responsibility for anything. Machine learning is like money laundering for bias. It's a clean, mathematical apparatus that gives the status quo the aura of logical inevitability. The numbers don't lie.


Particularly apposite today given Y Combinator's revelation that they use an AI bot to help 'sift admission applications', and don't know what criteria it's using: https://twitter.com/aprjoy/status/783032128653107200
culture  ethics  privacy  technology  surveillance  ml  machine-learning  bias  algorithms  software  control 
october 2016 by jm
How a Japanese cucumber farmer is using deep learning and TensorFlow
Unfortunately the usual ML problem arises at the end:
One of the current challenges with deep learning is that you need to have a large number of training datasets. To train the model, Makoto spent about three months taking 7,000 pictures of cucumbers sorted by his mother, but it’s probably not enough. "When I did a validation with the test images, the recognition accuracy exceeded 95%. But if you apply the system with real use cases, the accuracy drops down to about 70%. I suspect the neural network model has the issue of "overfitting" (the phenomenon in neural network where the model is trained to fit only to the small training dataset) because of the insufficient number of training images."


In other words, as with ML since we were using it in SpamAssassin, maintaining the training corpus becomes a really big problem. :(
google  machine-learning  tensorflow  cucumbers  deep-learning  ml 
september 2016 by jm
Hey Microsoft, the Internet Made My Bot Racist, Too
All machine learning algorithms strive to exaggerate and perpetuate the past. That is, after all, what they are learning from. The fundamental assumption of every machine learning algorithm is that the past is correct, and anything coming in the future will be, and should be, like the past. This is a fine assumption to make when you are Netflix trying to predict what movie you’ll like, but is immoral when applied to many other situations. For bots like mine and Microsoft’s, built for entertainment purposes, it can lead to embarrassment. But AI has started to be used in much more meaningful ways: predictive policing in Chicago, for example, has already led to widespread accusations of racial profiling.
This isn’t a little problem. This is a huge problem, and it demands a lot more attention then it’s getting now, particularly in the community of scientists and engineers who design and apply these algorithms. It’s one thing to get cursed out by an AI, but wholly another when one puts you in jail, denies you a mortgage, or decides to audit you.
machine-learning  ml  algorithms  future  society  microsoft 
march 2016 by jm
DeepMind founder Demis Hassabis on how AI will shape the future | The Verge
Good interview with Demis Hassabis on DeepMind, AlphaGo and AI:
I’d like to see AI-assisted science where you have effectively AI research assistants that do a lot of the drudgery work and surface interesting articles, find structure in vast amounts of data, and then surface that to the human experts and scientists who can make quicker breakthroughs. I was giving a talk at CERN a few months ago; obviously they create more data than pretty much anyone on the planet, and for all we know there could be new particles sitting on their massive hard drives somewhere and no-one’s got around to analyzing that because there’s just so much data. So I think it’d be cool if one day an AI was involved in finding a new particle.
ai  deepmind  google  alphago  demis-hassabis  cern  future  machine-learning 
march 2016 by jm
The NSA’s SKYNET program may be killing thousands of innocent people
Death by Random Forest: this project is a horrible misapplication of machine learning. Truly appalling, when a false positive means death:

The NSA evaluates the SKYNET program using a subset of 100,000 randomly selected people (identified by their MSIDN/MSI pairs of their mobile phones), and a a known group of seven terrorists. The NSA then trained the learning algorithm by feeding it six of the terrorists and tasking SKYNET to find the seventh. This data provides the percentages for false positives in the slide above.

"First, there are very few 'known terrorists' to use to train and test the model," Ball said. "If they are using the same records to train the model as they are using to test the model, their assessment of the fit is completely bullshit. The usual practice is to hold some of the data out of the training process so that the test includes records the model has never seen before. Without this step, their classification fit assessment is ridiculously optimistic."

The reason is that the 100,000 citizens were selected at random, while the seven terrorists are from a known cluster. Under the random selection of a tiny subset of less than 0.1 percent of the total population, the density of the social graph of the citizens is massively reduced, while the "terrorist" cluster remains strongly interconnected. Scientifically-sound statistical analysis would have required the NSA to mix the terrorists into the population set before random selection of a subset—but this is not practical due to their tiny number.

This may sound like a mere academic problem, but, Ball said, is in fact highly damaging to the quality of the results, and thus ultimately to the accuracy of the classification and assassination of people as "terrorists." A quality evaluation is especially important in this case, as the random forest method is known to overfit its training sets, producing results that are overly optimistic. The NSA's analysis thus does not provide a good indicator of the quality of the method.
terrorism  surveillance  nsa  security  ai  machine-learning  random-forests  horror  false-positives  classification  statistics 
february 2016 by jm
Fast Forward Labs: Fashion Goes Deep: Data Science at Lyst
this is more than just data science really -- this is proper machine learning, with deep learning and a convolutional neural network. serious business
lyst  machine-learning  data-science  ml  neural-networks  supervised-learning  unsupervised-learning  deep-learning 
december 2015 by jm
"Hidden Technical Debt in Machine-Learning Systems" [pdf]
Another great paper about from Google, talking about the tradeoffs that must be considered in practice over the long term with running a complex ML system in production.
technical-debt  ml  machine-learning  ops  software  production  papers  pdf  google 
december 2015 by jm
Control theory meets machine learning
'DB: Is there a difference between how control theorists and machine learning researchers think about robustness and error?

BR: In machine learning, we almost always model our errors as being random rather than worst-case. In some sense, random errors are actually much more benign than worst-case errors. [...] In machine learning, by assuming average-case performance, rather than worst-case, we can design predictive algorithms by averaging out the errors over large data sets. We want to be robust to fluctuations in the data, but only on average. This is much less restrictive than the worst-case restrictions in controls.

DB: So control theory is model-based and concerned with worst case. Machine learning is data based and concerned with average case. Is there a middle ground?

BR: I think there is! And I think there's an exciting opportunity here to understand how to combine robust control and reinforcement learning. Being able to build systems from data alone simplifies the engineering process, and has had several recent promising results. Guaranteeing that these systems won't behave catastrophically will enable us to actually deploy machine learning systems in a variety of applications with major impacts on our lives. It might enable safe autonomous vehicles that can navigate complex terrains. Or could assist us in diagnostics and treatments in health care. There are a lot of exciting possibilities, and that's why I'm excited about how to find a bridge between these two viewpoints.'
control-theory  interviews  machine-learning  ml  worst-case  self-driving-cars  cs 
november 2015 by jm
Tesla Autopilot mode is learning
This is really impressive, but also a little scary. Drivers driving the Tesla Model S are "phoning home" training data as they drive:
A Model S owner by the username Khatsalano kept a count of how many times he had to “rescue” (meaning taking control after an alert) his Model S while using the Autopilot on his daily commute. He counted 6 “rescues” on his first day, by the fourth day of using the system on his 23.5 miles commute, he only had to take control over once. Musk said that Model S owners could add ~1 million miles of new data every day, which is helping the company create “high precision maps”.


Wonder if the data protection/privacy implications have been considered for EU use.
autopilot  tesla  maps  mapping  training  machine-learning  eu  privacy  data-protection 
november 2015 by jm
Analysing user behaviour - from histograms to random forests (PyData) at PyCon Ireland 2015 | Lanyrd
Swrve's own Dave Brodigan on game user-data analysis techniques:
The goal is to give the audience a roadmap for analysing user data using python friendly tools.

I will touch on many aspects of the data science pipeline from data cleansing to building predictive data products at scale.

I will start gently with pandas and dataframes and then discuss some machine learning techniques like kmeans and random forests in scikitlearn and then introduce Spark for doing it at scale.

I will focus more on the use cases rather than detailed implementation.

The talk will be informed by my experience and focus on user behaviour in games and mobile apps.
swrve  talks  user-data  big-data  spark  hadoop  machine-learning  data-science 
october 2015 by jm
Schneier on Automatic Face Recognition and Surveillance
When we talk about surveillance, we tend to concentrate on the problems of data collection: CCTV cameras, tagged photos, purchasing habits, our writings on sites like Facebook and Twitter. We think much less about data analysis. But effective and pervasive surveillance is just as much about analysis. It's sustained by a combination of cheap and ubiquitous cameras, tagged photo databases, commercial databases of our actions that reveal our habits and personalities, and ­-- most of all ­-- fast and accurate face recognition software.

Don't expect to have access to this technology for yourself anytime soon. This is not facial recognition for all. It's just for those who can either demand or pay for access to the required technologies ­-- most importantly, the tagged photo databases. And while we can easily imagine how this might be misused in a totalitarian country, there are dangers in free societies as well. Without meaningful regulation, we're moving into a world where governments and corporations will be able to identify people both in real time and backwards in time, remotely and in secret, without consent or recourse.

Despite protests from industry, we need to regulate this budding industry. We need limitations on how our images can be collected without our knowledge or consent, and on how they can be used. The technologies aren't going away, and we can't uninvent these capabilities. But we can ensure that they're used ethically and responsibly, and not just as a mechanism to increase police and corporate power over us.
privacy  regulation  surveillance  bruce-schneier  faces  face-recognition  machine-learning  ai  cctv  photos 
october 2015 by jm
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Extremely authoritative slide deck on building a recommendation system, from Xavier Amatriain, Research/Engineering Manager at Netflix
netflix  recommendations  recommenders  ml  machine-learning  cmu  clustering  algorithms 
august 2015 by jm
The reusable holdout: Preserving validity in adaptive data analysis
Useful stats hack from Google: "We show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses."
statistics  google  reusable-holdout  training  ml  machine-learning  data-analysis  holdout  corpus  sampling 
august 2015 by jm
Outlier Detection at Netflix | Hacker News
Excellent HN thread re automated anomaly detection in production, Q&A with the dev team
machine-learning  ml  remediation  anomaly-detection  netflix  ops  time-series  clustering 
july 2015 by jm
Inceptionism: Going Deeper into Neural Networks
This is amazing, and a little scary.
If we choose higher-level layers, which identify more sophisticated features in images, complex features or even whole objects tend to emerge. Again, we just start with an existing image and give it to our neural net. We ask the network: “Whatever you see there, I want more of it!” This creates a feedback loop: if a cloud looks a little bit like a bird, the network will make it look more like a bird. This in turn will make the network recognize the bird even more strongly on the next pass and so forth, until a highly detailed bird appears, seemingly out of nowhere.

An enlightening comment from the G+ thread:

This is the most fun we've had in the office in a while. We've even made some of those 'Inceptionistic' art pieces into giant posters. Beyond the eye candy, there is actually something deeply interesting in this line of work: neural networks have a bad reputation for being strange black boxes that that are opaque to inspection. I have never understood those charges: any other model (GMM, SVM, Random Forests) of any sufficient complexity for a real task is completely opaque for very fundamental reasons: their non-linear structure makes it hard to project back the function they represent into their input space and make sense of it. Not so with backprop, as this blog post shows eloquently: you can query the model and ask what it believes it is seeing or 'wants' to see simply by following gradients. This 'guided hallucination' technique is very powerful and the gorgeous visualizations it generates are very evocative of what's really going on in the network.
art  machine-learning  algorithm  inceptionism  research  google  neural-networks  learning  dreams  feedback  graphics 
june 2015 by jm
Top 10 data mining algorithms in plain English
This is a phenomenally useful ML/data-mining resource post -- 'the top 10 most influential data mining algorithms as voted on by 3 separate panels in [ICDM '06's] survey paper', but with a nice clear intro and description for each one. Here's the algorithms covered:
1. C4.5
2. k-means
3. Support vector machines
4. Apriori
5. EM
6. PageRank
7. AdaBoost
8. kNN
9. Naive Bayes
10. CART
svm  k-means  c4.5  apriori  em  pagerank  adaboost  knn  naive-bayes  cart  ml  data-mining  machine-learning  papers  algorithms  unsupervised  supervised 
may 2015 by jm
How to do named entity recognition: machine learning oversimplified
Good explanation of this NLP tokenization/feature-extraction technique. Example result: "Jimi/B-PER Hendrix/I-PER played/O at/O Woodstock/B-LOC ./O"
named-entities  feature-extraction  tokenization  nlp  ml  algorithms  machine-learning 
may 2015 by jm
Five Takeaways on the State of Natural Language Processing
Good overview of the state of the art in NLP nowadays. I particularly like word2vec interesting:
Embedding words as real-numbered vectors using a skip-gram, negative-sampling model (word2vec code) was mentioned in nearly every talk I attended. Either companies are using various word2vec implementations directly or they are building diffs off of the basic framework. Trained on large corpora, the vector representations encode concepts in a large dimensional space (usually 200-300 dim).


Quite similar to some tokenization approaches we experimented with in SpamAssassin, so I don't find this too surprising....
word2vec  nlp  tokenization  machine-learning  language  parsing  doc2vec  skip-grams  data-structures  feature-extraction  via:lemonodor 
may 2015 by jm
How the NSA Converts Spoken Words Into Searchable Text - The Intercept
This hits the nail on the head, IMO:
To Phillip Rogaway, a professor of computer science at the University of California, Davis, keyword-search is probably the “least of our problems.” In an email to The Intercept, Rogaway warned that “When the NSA identifies someone as ‘interesting’ based on contemporary NLP methods, it might be that there is no human-understandable explanation as to why beyond: ‘his corpus of discourse resembles those of others whom we thought interesting'; or the conceptual opposite: ‘his discourse looks or sounds different from most people’s.' If the algorithms NSA computers use to identify threats are too complex for humans to understand, it will be impossible to understand the contours of the surveillance apparatus by which one is judged.  All that people will be able to do is to try your best to behave just like everyone else.”
privacy  security  gchq  nsa  surveillance  machine-learning  liberty  future  speech  nlp  pattern-analysis  cs 
may 2015 by jm
Amazon Machine Learning
Upsides of this new AWS service:

* great UI and visualisations.

* solid choice of metric to evaluate the results. Maybe things moved on since I was working on it, but the use of AUC, false positives and false negatives was pretty new when I was working on it. (er, 10 years ago!)

Downsides:

* it could do with more support for unsupervised learning algorithms. Supervised learning means you need to provide training data, which in itself can be hard work. My experience with logistic regression in the past is that it requires very accurate training data, too -- its tolerance for misclassified training examples is poor.

* Also, in my experience, 80% of the hard work of using ML algorithms is writing good tokenisation and feature extraction algorithms. I don't see any help for that here unfortunately. (probably not that surprising as it requires really detailed knowledge of the input data to know what classes can be abbreviated into a single class, etc.)
amazon  aws  ml  machine-learning  auc  data-science 
april 2015 by jm
President's message gets lost in (automated) translation
In a series of bizarre translations, YouTube’s automated translation service took artistic licence with the [President's] words of warmth.

When the head of state sent St Patrick’s Day greetings to viewers, the video sharing site said US comedian Tina Fey was being “particular with me head”. As President Higgins spoke of his admiration for Irish emigrants starting new communities abroad, YouTube said the President referenced blackjack and how he “just couldn’t put the new iPhone” down. And, in perhaps the most unusual moment, as he talked of people whose hearts have sympathy, the President “explained” he was once on a show “that will bar a gift card”.


(via Daragh O'Brien)
lol  president  ireland  michael-d-higgins  automation  translation  machine-learning  via:daraghobrien  funny  blackjack  iphone  tina-fey  st-patrick  fail 
march 2015 by jm
Automating Tinder with Eigenfaces
While my friends were getting sucked into "swiping" all day on their phones with Tinder, I eventually got fed up and designed a piece of software that automates everything on Tinder.


This is awesome. (via waxy)
via:waxy  tinder  eigenfaces  machine-learning  k-nearest-neighbour  algorithms  automation  ai 
february 2015 by jm
"Man vs Machine: Practical Adversarial Detection of Malicious Crowdsourcing Workers" [paper]
"traditional ML techniques are accurate (95%–99%) in detection but can be highly vulnerable to adversarial attacks". ain't that the truth
security  adversarial-attacks  machine-learning  paper  crowdsourcing  via:kragen 
february 2015 by jm
'Uncertain<T>: A First-Order Type for Uncertain Data' [paper, PDF]
'Emerging applications increasingly use estimates such as sensor
data (GPS), probabilistic models, machine learning, big
data, and human data. Unfortunately, representing this uncertain
data with discrete types (floats, integers, and booleans)
encourages developers to pretend it is not probabilistic, which
causes three types of uncertainty bugs. (1) Using estimates
as facts ignores random error in estimates. (2) Computation
compounds that error. (3) Boolean questions on probabilistic
data induce false positives and negatives.
This paper introduces Uncertain<T>, a new programming
language abstraction for uncertain data. We implement a
Bayesian network semantics for computation and conditionals
that improves program correctness. The runtime uses sampling
and hypothesis tests to evaluate computation and conditionals
lazily and efficiently. We illustrate with sensor and
machine learning applications that Uncertain<T> improves
expressiveness and accuracy.'

(via Tony Finch)
via:fanf  uncertainty  estimation  types  strong-typing  coding  probability  statistics  machine-learning  sampling 
december 2014 by jm
'Machine Learning: The High-Interest Credit Card of Technical Debt' [PDF]
Oh god yes. This is absolutely spot on, as you would expect from a Google paper -- at this stage they probably have accumulated more real-world ML-at-scale experience than anywhere else.

'Machine learning offers a fantastically powerful toolkit for building complex systems
quickly. This paper argues that it is dangerous to think of these quick wins
as coming for free. Using the framework of technical debt, we note that it is remarkably
easy to incur massive ongoing maintenance costs at the system level
when applying machine learning. The goal of this paper is highlight several machine
learning specific risk factors and design patterns to be avoided or refactored
where possible. These include boundary erosion, entanglement, hidden feedback
loops, undeclared consumers, data dependencies, changes in the external world,
and a variety of system-level anti-patterns.

[....]

'In this paper, we focus on the system-level interaction between machine learning code and larger systems
as an area where hidden technical debt may rapidly accumulate. At a system-level, a machine
learning model may subtly erode abstraction boundaries. It may be tempting to re-use input signals
in ways that create unintended tight coupling of otherwise disjoint systems. Machine learning
packages may often be treated as black boxes, resulting in large masses of “glue code” or calibration
layers that can lock in assumptions. Changes in the external world may make models or input
signals change behavior in unintended ways, ratcheting up maintenance cost and the burden of any
debt. Even monitoring that the system as a whole is operating as intended may be difficult without
careful design.

Indeed, a remarkable portion of real-world “machine learning” work is devoted to tackling issues
of this form. Paying down technical debt may initially appear less glamorous than research results
usually reported in academic ML conferences. But it is critical for long-term system health and
enables algorithmic advances and other cutting-edge improvements.'
machine-learning  ml  systems  ops  tech-debt  maintainance  google  papers  hidden-costs  development 
december 2014 by jm
'Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm' [PDF]
'Unsupervised anomaly detection is the process of finding outliers in data sets without prior training. In this paper, a histogram-based outlier detection (HBOS) algorithm is presented, which scores records in linear time. It assumes independence of the features making it much faster than multivariate approaches at the cost of less precision. A comparative evaluation on three UCI data sets and 10 standard algorithms show, that it can detect global outliers as reliable as state-of-the-art algorithms, but it performs poor on local outlier problems. HBOS is in our experiments up to 5 times faster than clustering based algorithms and up to 7 times faster than nearest-neighbor based methods.'
histograms  anomaly-detection  anomalies  machine-learning  algorithms  via:paperswelove  outliers  unsupervised-learning  hbos 
november 2014 by jm
Logentries Announces Machine Learning Analytics for IT Ops Monitoring and Real-time Alerting
This sounds pretty neat:
With Logentries Anomaly Detection, users can:

Set-up real-time alerting based on deviations from important patterns and log events.
Easily customize Anomaly thresholds and compare different time periods.

With Logentries Inactivity Alerting, users can:

Monitor standard, incoming events such as an application heart beat.
Receive real-time alerts based on log inactivity (i.e. receive alerts when something does not occur).
logging  syslog  logentries  anomaly-detection  ops  machine-learning  inactivity  alarms  alerting  heartbeats 
august 2014 by jm
Google's Influential Papers for 2013
Googlers across the company actively engage with the scientific community by publishing technical papers, contributing open-source packages, working on standards, introducing new APIs and tools, giving talks and presentations, participating in ongoing technical debates, and much more. Our publications offer technical and algorithmic advances, feature aspects we learn as we develop novel products and services, and shed light on some of the technical challenges we face at Google. Below are some of the especially influential papers co-authored by Googlers in 2013.
google  papers  toread  reading  2013  scalability  machine-learning  algorithms 
july 2014 by jm
Spark Streaming
an extension of the core Spark API that allows enables high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s in-built machine learning algorithms, and graph processing algorithms on data streams.
spark  streams  stream-processing  cep  scalability  apache  machine-learning  graphs 
may 2014 by jm
Unsupervised machine learning
aka. "zero-shot learning". ok starting point
machine-learning  zero-shot  unsupervised  algorithms  ml 
may 2014 by jm
How the search for flight AF447 used Bayesian inference
Via jgc, the search for the downed Air France flight was optimized using this technique:

'Metron’s approach to this search planning problem is rooted in classical Bayesian inference,
which allows organization of available data with associated uncertainties and computation of the
Probability Distribution Function (PDF) for target location given these data. In following this
approach, the first step was to gather the available information about the location of the impact site
of the aircraft. This information was sometimes contradictory and filled with ambiguities and
uncertainties. Using a Bayesian approach we organized this material into consistent scenarios,
quantified the uncertainties with probability distributions, weighted the relative likelihood of each
scenario, and performed a simulation to produce a prior PDF for the location of the wreck.'
metron  bayes  bayesian-inference  machine-learning  statistics  via:jgc  air-france  disasters  probability  inference  searching 
march 2014 by jm
Welcome to Algorithmic Prison - Bill Davidow - The Atlantic
"Computer says no", taken to the next level.
Even if an algorithmic prisoner knows he is in a prison, he may not know who his jailer is. Is he unable to get a loan because of a corrupted file at Experian or Equifax? Or could it be TransUnion? His bank could even have its own algorithms to determine a consumer’s creditworthiness. Just think of the needle-in-a-haystack effort consumers must undertake if they are forced to investigate dozens of consumer-reporting companies, looking for the one that threw them behind algorithmic bars. Now imagine a future that contains hundreds of such companies. A prisoner might not have any idea as to what type of behavior got him sentenced to a jail term. Is he on an enhanced screening list at an airport because of a trip he made to an unstable country, a post on his Facebook page, or a phone call to a friend who has a suspected terrorist friend?
privacy  data  big-data  algorithms  machine-learning  equifax  experian  consumer  society  bill-davidow 
february 2014 by jm
SAMOA, an open source platform for mining big data streams
Yahoo!'s streaming machine learning platform, built on Storm, implementing:

As a library, SAMOA contains state-of-the-art implementations of algorithms for distributed machine learning on streams. The first alpha release allows classification and clustering. For classification, we implemented a Vertical Hoeffding Tree (VHT), a distributed streaming version of decision trees tailored for sparse data (e.g., text). For clustering, we included a distributed algorithm based on CluStream. The library also includes meta-algorithms such as bagging.
storm  streaming  big-data  realtime  samoa  yahoo  machine-learning  ml  decision-trees  clustering  bagging  classification 
november 2013 by jm
Find a separating hyperplane with this One Weird Kernel Trick
Terrible internet ad-spam recast as machine-learning spam
'37-year-old patriot discovers "weird" trick to end slavery to the Bayesian monopoly. Discover the underground trick she used to slash her empirical risk by 75% in less than 30 days... before they shut her down. Click here to watch the shocking video! Get the Shocking Free Report!'
funny  via:hmason  machine-learning  spam  wtf  svms  bayesian 
november 2013 by jm
Probabalistic Scraping of Plain Text Tables
a nifty hack.
Recently I have been banging my head trying to import a ton of OCR acquired data expressed in tabular form. I think I have come up with a neat approach using probabilistic reasoning combined with mixed integer programming. The method is pretty robust to all sorts of real world issues. In particular, the method leverages topological understanding of tables, encodes it declaratively into a mixed integer/linear program, and integrates weak probabilistic signals to classify the whole table in one go (at sub second speeds). This method can be used for any kind of classification where you have strong logical constraints but noisy data.


(via proggit)
scraping  tables  ocr  probabilistic  linear-programming  optimization  machine-learning  via:proggit 
september 2013 by jm
Forecast Blog
Forecast.io are doing such a great job of applying modern machine-learning to traditional weather data. "Quicksilver" is their neural-net-adjusted global temperature geodata, and here's how it's built
quicksilver  forecast  forecast.io  neural-networks  ai  machine-learning  algorithms  weather  geodata  earth  temperature 
august 2013 by jm
Machine Learning Speeds TCP
Cool. A machine-learning-generated TCP congestion control algorithm which handily beats sfqCoDel, Vegas, Reno et al. But:
"Although the [computer-generated congestion control algorithms] appear to work well on networks whose parameters fall within or near the limits of what they were prepared for -- even beating in-network schemes at their own game and even when the design range spans an order of magnitude variation in network parameters -- we do not yet understand clearly why they work, other than the observation that they seem to optimize their intended objective well.

We have attempted to make algorithms ourselves that surpass
the generated RemyCCs, without success. That suggests to us that Remy may have accomplished something substantive. But digging through the dozens of rules in a RemyCC and figuring out their purpose and function is a challenging job in reverse-engineering. RemyCCs designed for broader classes of networks will likely be even more complex, compounding the problem."

So are network engineers willing to trust an algorithm that seems to work but has no explanation as to why it works other than optimizing a specific objective function? As AI becomes increasingly successful the question could also be asked in a wider context.  


(via Bill de hOra)
via-dehora  machine-learning  tcp  networking  hmm  mit  algorithms  remycc  congestion 
july 2013 by jm
Google Translate of "Lorem ipsum"
The perils of unsupervised machine learning... here's what GTranslate reckons "lorem ipsum" translates to:
We will be sure to post a comment. Add tomato sauce, no tank or a traditional or online. Until outdoor environment, and not just any competition, reduce overall pain. Cisco Security, they set up in the throat develop the market beds of Cura; Employment silently churn-class by our union, very beginner himenaeos. Monday gate information. How long before any meaningful development. Until mandatory functional requirements to developers. But across the country in the spotlight in the notebook. The show was shot. Funny lion always feasible, innovative policies hatred assured. Information that is no corporate Japan
lorem-ipsum  boilerplate  machine-learning  translation  google  translate  probabilistic  tomato-sauce  cisco  funny 
june 2013 by jm
Paper: "Root Cause Detection in a Service-Oriented Architecture" [pdf]
LinkedIn have implemented an automated root-cause detection system:

This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean
average precision in finding root causes compared to baseline and current state-of-the-art methods.


This is a topic close to my heart after working on something similar for 3 years in Amazon!

Looks interesting, although (a) I would have liked to see more case studies and examples of "real world" outages it helped with; and (b) it's very much a machine-learning paper rather than a systems one, and there is no discussion of fault tolerance in the design of the detection system, which would leave me worried that in the case of a large-scale outage event, the system itself will disappear when its help is most vital. (This was a major design influence on our team's work.)

Overall, particularly given those 2 issues, I suspect it's not in production yet. Ours certainly was ;)
linkedin  soa  root-cause  alarming  correlation  service-metrics  machine-learning  graphs  monitoring 
june 2013 by jm
Abusing hash kernels for wildly unprincipled machine learning
what, is this the first time our spam filtering approach of hashing a giant feature space is hitting mainstream machine learning? that can't be right!
ai  machine-learning  python  data  hashing  features  feature-selection  anti-spam  spamassassin 
april 2013 by jm
Clairvoyant Squirrel: Large Scale Malicious Domain Classification
Storm-based service to detect malicious DNS domain usage from streaming pcap data in near-real-time. Uses string features in the DNS domain, along with randomness metrics using Markov analysis, combined with a Random Forest classifier, to achieve 98% precision at 10,000 matches/sec
storm  distributed  distcomp  random-forest  classifiers  machine-learning  anti-spam  slides 
february 2013 by jm
Authentication is machine learning
This may be the most insightful writing about authentication in years:
<p>
From my brief time at Google, my internship at Yahoo!, and conversations with other companies doing web authentication at scale, I’ve observed that as authentication systems develop they gradually merge with other abuse-fighting systems dealing with various forms of spam (email, account creation, link, etc.) and phishing. Authentication eventually loses its binary nature and becomes a fuzzy classification problem.</p><p>This is not a new observation. It’s generally accepted for banking authentication and some researchers like Dinei Florêncio and Cormac Herley have made it for web passwords. Still, much of the security research community thinks of password authentication in a binary way [..]. Spam and phishing provide insightful examples: technical solutions (like Hashcash, DKIM signing, or EV certificates), have generally failed but in practice machine learning has greatly reduced these problems. The theory has largely held up that with enough data we can train reasonably effective classifiers to solve seemingly intractable problems.
</p>


(via Tony Finch.)
passwords  authentication  big-data  machine-learning  google  abuse  antispam  dkim  via:fanf 
december 2012 by jm
Practical machine learning tricks from the KDD 2011 best industry paper
Wow, this is a fantastic paper. It's a Google paper on detecting scam/spam ads using machine learning -- but not just that, it's how to build out such a classifier to production scale, and make it operationally resilient, and, indeed, operable.

I've come across a few of these ideas before, and I'm happy to say I might have reinvented a few (particularly around the feature space), but all of them together make extremely good sense. If I wind up working on large-scale classification again, this is the first paper I'll go back to. Great info! (via Toby diPasquale.)
classification  via:codeslinger  training  machine-learning  google  ops  kdd  best-practices  anti-spam  classifiers  ensemble  map-reduce 
july 2012 by jm
'Poisoning Attacks against Support Vector Machines', Battista Biggio, Blaine Nelson, Pavel Laskov
The perils of auto-training SVMs on unvetted input.
We investigate a family of poisoning attacks against Support Vector Machines (SVM). Such attacks inject specially crafted training data that increases the SVM's test error. Central to the motivation for these attacks is the fact that most learning algorithms assume that their training data comes from a natural or well-behaved distribution. However, this assumption does not generally hold in security-sensitive settings. As we demonstrate, an intelligent adversary can, to some extent, predict the change of the SVM's decision function due to malicious input and use this ability to construct malicious data. The proposed attack uses a gradient ascent strategy in which the gradient is computed based on properties of the SVM's optimal solution. This method can be kernelized and enables the attack to be constructed in the input space even for non-linear kernels. We experimentally demonstrate that our gradient ascent procedure reliably identifies good local maxima of the non-convex validation error surface, which significantly increases the classifier's test error.

Via Alexandre Dulaunoy
papers  svm  machine-learning  poisoning  auto-learning  security  via:adulau 
july 2012 by jm
"Machine Learning That Matters" [paper, PDF]
Great paper. This point particularly resonates: "It is easy to sit in your office and run a Weka algorithm on a data set you downloaded from the web. It is very hard to identify a problem for which machine learning may offer a solution, determine what data should be collected, select or extract relevant features, choose an appropriate learning method, select an evaluation method, interpret the results, involve domain experts, publicize the results to the relevant scientific community, persuade users to adopt the technique, and (only then) to truly have made a difference (see Figure 1). An ML researcher might well feel fatigued or daunted just contemplating this list of activities. However, each one is a necessary component of any research program that seeks to have a real impact on the world outside of machine learning."
machine-learning  ml  software  data  real-world  algorithms 
june 2012 by jm
_Building High-level Features Using Large Scale Unsupervised Learning_ [paper, PDF]
"We consider the problem of building highlevel, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images using unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art."
algorithms  machine-learning  neural-networks  sgd  labelling  training  unlabelled-learning  google  research  papers  pdf 
june 2012 by jm
HN on "What it takes to build great machine learning products"
TBH, I think this discussion thread is more useful than the article itself. It's still remarkably difficult to successfully apply ML techniques to real-world problems :(
machine-learning  hacker-news  discussion  commentary  ai  algorithms 
april 2012 by jm
Operations, machine learning and premature babies - O'Reilly Radar
good post about applying ML techniques to ops data. 'At a recent meetup about finance, Abhi Mehta encouraged people to capture and save "everything." He was talking about financial data, but the same applies here. We'd need to build Hadoop clusters to monitor our server farms; we'd need Hadoop clusters to monitor our Hadoop clusters. It's a big investment of time and resources. If we could make that investment, what would we find out? I bet that we'd be surprised.' Let's just say that if you like the sound of that, our SDE team in Amazon's Dublin office is hiring ;)
ops  big-data  machine-learning  hadoop  ibm 
april 2012 by jm
The first Irish case on defamation via autocomplete
Google Instant has picked up people searching for 'Ballymascanlon hotel receivership' and is now offering this as an autocomplete option -- cue defamation lawsuit. Defamation via machine learning
machine-learning  defamation  google  google-instant  search  ballymascanlon  hotels  autocomplete  law-enforcement 
june 2011 by jm
Technology to track trad
TunePal -- "Shazam for trad". play it a live traditional Irish, Scots, Welsh, Breton, Old Time American, Canadian or Appalachian trad tune on the iPhone, and it'll link to the tune's name, history, discography, and where it's been played, based on melodic similarity with a 93% accuracy
trad  irish  via:klillington  music  recognition  machine-learning  from delicious
july 2010 by jm
Chatroulette Working On Genital Recognition Algorithm
just *male* genitalia, mind. I dread to think of what the training corpus looks like
chatroulette  algorithms  machine-learning  genitalia  nsfw  slashdot  from delicious
june 2010 by jm
Google Translate fail
Google reckons that the English translation of "Amhran na bhFiann" -- the Irish national anthem -- is "Save The Queen". ie. part of the *English* national anthem. the perils of machine learning (via Adam Maguire)
via:AdamMaguire  funny  fail  google  translation  machine-learning  from delicious
january 2010 by jm

related tags

abuse  accountability  adaboost  adversarial-attacks  adversarial-classification  advertising  ai  air-france  alarming  alarms  alerting  algorithm  algorithms  alphago  amazon  analysis  anomalies  anomaly-detection  anti-spam  antispam  apache  apriori  art  auc  authentication  auto-learning  autocomplete  automation  autopilot  aws  bagging  ballymascanlon  bayes  bayesian  bayesian-inference  best-practices  bias  big-data  bill-davidow  blackjack  boilerplate  bruce-schneier  c4.5  cart  cctv  cep  cern  chatroulette  cisco  classification  classifiers  clustering  cmu  coding  commentary  communication  congestion  consumer  control  control-theory  cool  corpus  correlation  crf++  crime  crowdsourcing  cs  cucumbers  culture  d3  data  data-analysis  data-mining  data-privacy  data-protection  data-science  data-structures  dataviz  decision-trees  deep-learning  deepmind  defamation  demis-hassabis  democracy  development  disasters  discussion  distcomp  distributed  dkim  dnns  doc2vec  dreams  earth  eigenfaces  em  emoji  ensemble  equifax  estimation  ethics  eu  europe  experian  face-recognition  facebook  faces  fail  false-positives  feature-extraction  feature-selection  features  federated-learning  feedback  forecast  forecast.io  fred-benenson  funny  future  gaming  gboard  gchq  gdpr  genitalia  geodata  google  google-instant  googlex  graphics  graphs  hacker-news  hadoop  hashing  hbos  heartbeats  hidden-costs  histograms  hmm  holdout  horror  hotels  ibm  image-macros  implicit-bias  inactivity  inceptionism  inequality  inference  instagram  internet  interviews  iphone  ireland  irish  k-means  k-nearest-neighbour  kdd  knn  labelling  language  law-enforcement  learning  liberty  linear-programming  linkedin  logentries  logging  lol  lorem-ipsum  lyst  machine-learning  maciej-ceglowski  maintainance  map-reduce  mapping  maps  math  maths  mathwashing  media  memes  metron  michael-d-higgins  microsoft  mit  ml  models  monitoring  morality  music  naive-bayes  named-entities  netflix  networking  neural-networks  news  nlp  nsa  nsfw  nvidia  nytimes  ocr  ops  optimization  outliers  pagerank  paper  papers  parsing  passwords  pattern-analysis  pdf  pheme  photos  poisoning  policing  politics  prejudice  president  privacy  probabilistic  probability  production  python  quicksilver  race  racism  random-forest  random-forests  reading  real-world  realtime  recipes  recognition  recommendations  recommenders  regulation  remediation  remycc  research  reusable-holdout  right-to-explanation  root-cause  samoa  sampling  scalability  scary  scraping  search  searching  sebastian-thrun  security  self-driving-cars  service-metrics  sgd  skip-grams  slashdot  slides  soa  social-media  society  software  spam  spamassassin  spark  speech  st-patrick  statistics  storm  stream-processing  streaming  streams  strong-typing  supervised  supervised-learning  surveillance  svm  svms  swrve  syslog  systems  tables  talks  tcp  tech-debt  technical-debt  technology  ted-talks  temperature  tensorflow  terrorism  tesla  testing  text  time-series  tina-fey  tinder  tokenization  tomato-sauce  toread  trad  training  translate  translation  trends  truth  types  udacity  uncertainty  unlabelled-learning  unsupervised  unsupervised-learning  user-data  via-dehora  via:AdamMaguire  via:adulau  via:codeslinger  via:daraghobrien  via:daveb  via:fanf  via:hmason  via:jgc  via:klillington  via:kragen  via:lemonodor  via:maciej  via:nelson  via:paperswelove  via:proggit  via:waxy  weather  web  word2vec  worst-case  wtf  yahoo  zero-shot  zeynep-tufekci 

Copy this bookmark:



description:


tags: