jm + data-science   11

pachyderm
'Containerized Data Analytics':
There are two bold new ideas in Pachyderm:

Containers as the core processing primitive
Version Control for data

These ideas lead directly to a system that's much more powerful, flexible and easy to use.

To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).

Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes and diffs, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!

Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!
analytics  data  containers  golang  pachyderm  tools  data-science  docker  version-control 
february 2017 by jm
Reproducible research: Stripe’s approach to data science
This is intriguing -- using Jupyter notebooks to embody data analysis work, and ensure it's reproducible, which brings better rigour similarly to how unit tests improve coding. I must try this.
Reproducibility makes data science at Stripe feel like working on GitHub, where anyone can obtain and extend others’ work. Instead of islands of analysis, we share our research in a central repository of knowledge. This makes it dramatically easier for anyone on our team to work with our data science research, encouraging independent exploration.

We approach our analyses with the same rigor we apply to production code: our reports feel more like finished products, research is fleshed out and easy to understand, and there are clear programmatic steps from start to finish for every analysis.
stripe  coding  data-science  reproducability  science  jupyter  notebooks  analysis  data  experiments 
november 2016 by jm
The Fall of BIG DATA – arg min blog
Strongly agreed with this -- particularly the second of the three major failures, specifically:
Our community has developed remarkably effective tools to microtarget advertisements. But if you use ad models to deliver news, that’s propaganda. And just because we didn’t intend to spread rampant misinformation doesn’t mean we are not responsible.
big-data  analytics  data-science  statistics  us-politics  trump  data  science  propaganda  facebook  silicon-valley 
november 2016 by jm
Fast Forward Labs: Fashion Goes Deep: Data Science at Lyst
this is more than just data science really -- this is proper machine learning, with deep learning and a convolutional neural network. serious business
lyst  machine-learning  data-science  ml  neural-networks  supervised-learning  unsupervised-learning  deep-learning 
december 2015 by jm
Analysing user behaviour - from histograms to random forests (PyData) at PyCon Ireland 2015 | Lanyrd
Swrve's own Dave Brodigan on game user-data analysis techniques:
The goal is to give the audience a roadmap for analysing user data using python friendly tools.

I will touch on many aspects of the data science pipeline from data cleansing to building predictive data products at scale.

I will start gently with pandas and dataframes and then discuss some machine learning techniques like kmeans and random forests in scikitlearn and then introduce Spark for doing it at scale.

I will focus more on the use cases rather than detailed implementation.

The talk will be informed by my experience and focus on user behaviour in games and mobile apps.
swrve  talks  user-data  big-data  spark  hadoop  machine-learning  data-science 
october 2015 by jm
Amazon Machine Learning
Upsides of this new AWS service:

* great UI and visualisations.

* solid choice of metric to evaluate the results. Maybe things moved on since I was working on it, but the use of AUC, false positives and false negatives was pretty new when I was working on it. (er, 10 years ago!)

Downsides:

* it could do with more support for unsupervised learning algorithms. Supervised learning means you need to provide training data, which in itself can be hard work. My experience with logistic regression in the past is that it requires very accurate training data, too -- its tolerance for misclassified training examples is poor.

* Also, in my experience, 80% of the hard work of using ML algorithms is writing good tokenisation and feature extraction algorithms. I don't see any help for that here unfortunately. (probably not that surprising as it requires really detailed knowledge of the input data to know what classes can be abbreviated into a single class, etc.)
amazon  aws  ml  machine-learning  auc  data-science 
april 2015 by jm
Analyzing Citibike Usage
Abe Stanway crunches the stats on Citibike usage in NYC, compared to the weather data from Wunderground.
data  correlation  statistics  citibike  cycling  nyc  data-science  weather 
march 2014 by jm
How Kaggle Is Changing How We Work - Thomas Goetz - The Atlantic

Founded in 2010, Kaggle is an online platform for data-mining and predictive-modeling competitions. A company arranges with Kaggle to post a dump of data with a proposed problem, and the site's community of computer scientists and mathematicians -- known these days as data scientists -- take on the task, posting proposed solutions.

[...] On one level, of course, Kaggle is just another spin on crowdsourcing, tapping the global brain to solve a big problem. That stuff has been around for a decade or more, at least back to Wikipedia (or farther back, Linux, etc). And companies like TaskRabbit and oDesk have thrown jobs to the crowd for several years. But I think Kaggle, and other online labor markets, represent more than that, and I'll offer two arguments. First, Kaggle doesn't incorporate work from all levels of proficiency, professionals to amateurs. Participants are experts, and they aren't working for benevolent reasons alone: they want to win, and they want to get better to improve their chances of winning next time. Second, Kaggle doesn't just create the incidental work product, it creates a new marketplace for work, a deeper disruption in a professional field. Unlike traditional temp labor, these aren't bottom of the totem pole jobs. Kagglers are on top. And that disruption is what will kill Joy's Law.

Because here's the thing: the Kaggle ranking has become an essential metric in the world of data science. Employers like American Express and the New York Times have begun listing a Kaggle rank as an essential qualification in their help wanted ads for data scientists. It's not just a merit badge for the coders; it's a more significant, more valuable, indicator of capability than our traditional benchmarks for proficiency or expertise. In other words, your Ivy League diploma and IBM resume don't matter so much as my Kaggle score. It's flipping the resume, where your work is measurable and metricized and your value in the marketplace is more valuable than the place you work.
academia  datamining  economics  data  kaggle  data-science  ranking  work  competition  crowdsourcing  contracting 
april 2013 by jm
What can data scientists learn from DevOps?
Interesting.

'Rather than continuing to pretend analysis is a one-time, ad hoc action, automate it. [...] you need to maintain the automation machinery, but a cost-benefit analysis will show that the effort rapidly pays off — particularly for complex actions such as analysis that are nontrivial to get right.' (via @fintanr)
via:fintanr  data-science  data  automation  devops  analytics  analysis 
november 2012 by jm

Copy this bookmark:



description:


tags: