computation 1767
The Listening Machine
2 days ago by ahk
Generate music streams with parameters based on twitter sentiment analysis against 500 users longitudinally. Extremely similar to my generated music to data idea.
music
computation
machineLearning
2 days ago by ahk
Our approach to replication in computational science
26 days ago by Vnoel
I'm pretty proud of our most recently posted paper, which is on a
sequence analysis concept we call digital normalization. I think the paper is
pretty kick-ass, but so is the way in which we're approaching
replication. This blog post is about the latter.
(Quick note re "replication" vs "reproduction": The distinction
between replication and reproducibility is, from what I understand,
that "replicable" means "other people get exactly the same results
when doing exactly the same thing", while "reproducible" means
"something similar happens in other people's hands". The latter is
far stronger, in general, because it indicates that your results are
not merely some quirk of your setup and may actually be right.)
So what did we do to make this paper extra super replicable?
If you go to the paper Web site, you'll find:
a link to the paper itself, in preprint form, stored at the arXiv
site;
a tutorial for running the software on a Linux machine hosted in
the Amazon cloud;
a git repository for the software itself (hosted on github);
a git repository for the LaTeX paper and analysis scripts (also
hosted on github), including an ipython notebook for generating the
figures (more about that in my next blog post);
instructions on how to start up an EC2 cloud instance, install the
software and paper pipeline, and build most of the analyses and all
of the figures from scratch;
the data necessary to run the pipeline;
some of the output data discussed in the paper.
(Whew, it makes me a little tired just to type all that...)
What this means is that you can regenerate substantial amounts (but
not all) of the data and analyses underlying the paper from scratch,
all on your own, on a machine that you can rent for something like 50
cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to
re-run everything, plus some incidental costs for things like downloads.)
Not only can you do this, but if you try it, it will actually work.
I've done my best to make sure the darn thing works, and this is the
actual pipeline we ourselves ran to produce the figures in the paper.
All the data is there, and all of the code used to process the data,
analyze the results, and produce the figures is also there. In
version control.
When you combine that with the ability to run this on a specific EC2
instance -- a combination of a frozen virtual machine installation and
a specific set of hardware -- I feel pretty confident that at least
this component of our paper is something that can be replicated.
A few thoughts on replicability, and effort
Why did I go to all this trouble??
Wasn't it a lot of work?
Well, interestingly enough, it wasn't that much work. I already
use version control for everything, including paper text; posting it
all to github was a matter of about three commands.
Writing the code, analysis scripts, and paper was an immense amount of
work. But I had to do that anyway.
The most extra effort I put in was making sure that the big data files
were available. I didn't want to add the the 2gb E. coli resequencing
data set to git, for example. So I ended up tarballing those files
sticking them on S3.
The Makefile and analysis scripts are ugly, but suffice to remake
everything from scratch; they were already needed to make the paper,
so in order to post them all I had to do was put in a teensy bit of
effort to remove some unintentional dependencies.
The ipython notebook used to generate the figures (again -- next blog
post) was probably the most effort, because I had to learn how to use
it, which took about 20 minutes. But it was one of the smoothest
transitions into using a new tool I've ever experienced in my ~25 years
of coding.
Overall, it wasn't that much extra effort on my part.
Why bother in the first place??
The first and shortest answer is, because I could, and because I
believe in replication and reproducibility, and wanted to see how
tough it was to actually do something like this. (It's a good deal
above and beyond what most bioinformaticians do.)
Perhaps the strongest reason is that our group has been bitten a lot
in recent months by irreplicable results. I won't name names, but
several Science and PNAS and PLoS One papers of interest to us turned
out to be basically impossible for us to replicate. And, since we are
engaged in developing new computational methods that must be compared
to previous work, an inability to
regenerate exactly the results in those other papers meant we had to
work harder than we should have, simply to reproduce what they'd done.
A number of these problems came from people discarding large data sets
after publishing, under the mistaken belief that their submission to
the Short Read Archive could be used to regenerate their results.
(Often SRA submissions are unfiltered, and no one keeps the filtering
parameters around...right?) In some cases, I got the right data sets
from the authors and could replicate (kudos to Brian Haas of Trinity
for this!), but in most cases, ixnay on the eplicationre.
Then there were the cases where authors clearly were simply being bad
computational scientists. My favorite example is a very high profile
paper (coauthored by someone I admire greatly), in which the script
they sent to us -- a script necessary for the initial analyses -- had
a syntax error in it. In that case, we were fairly sure that the
authors weren't sending us the script they'd actually used... (It was
Perl, so admittedly it's hard to tell a syntax error from legitimate
code, but even the Perl interpreter was choking on this.)
(A few replication problems came from people using closed or
unpublished software, or being hand-wavy about the parameters they
used, or using version X of some Web-hosted pipeline for which only
version Y was now available. Clearly these are long-term issues that
need to be discussed with respect to replication in comp. bio., but
that's another topic.)
Thus, my group has wasted a lot of time replicating other people's
work. I wanted to avoid making other people go through that.
A third reason is that I really, really, really want to make it easy
for people to pick up this tool and use it. Digital normalization
is super ultra awesome and I want as little as possible to stand in
the way of others using it. So there's a strong element of
self-interest in doing things this way, and I hope it makes diginorm
more useful. (I know about a dozen people that have already tried it
out in the week or so since I made the paper available, which is
pretty cool. But citations will tell.)
What use is replication?
Way back when, Jim Graham politely schooled me in the true meaning of
reproducibility, as opposed to replication. He was about 2/3 right,
but then he went a bit too far and said
But let's drop the idea that I'm going to take your data and your
code and "reproduce" your result. I'm not. First, I've got my own
work to do. More importantly, the odds are that nobody will be any
wiser when I'm done."
Well, let's take a look at that concern, shall we?
With the benefit of about two years of further practice, I can tell
you this is a dangerously wrong way to think, at least in the field of
bioinformatics. My objections hinge on a few points:
First, based on our experiences so far, I'd be surprised if the
authors themselves could replicate their own computational results --
too many files and parameters are missing. We call that "bad
science".
Second, odds are, the senior professor has little or no detailed
understanding of what bioinformatic steps were taken in processing the
data, and moreover is uninterested in the details; that's why they're
not in the Methods. Why is that a problem? Because the odds are
quite good that many biological analyses hinge critically on such
points. So the peer reviewers and the community at large need to be
able to evaluate them (see this RNA editing kerfuffle for an
excellent example of reviewer fail). Yet most bioinformatic pipelines
are so terribly described that even with some WAG I can't figure out
what, roughly speaking, is going on. I certainly couldn't replicate
it, and generating specific critiques is quite difficult in that kind
of circumstance.
Parenthetically, Graham does refer to the climate sciences struggles
with reproducibility and replication. If only they put the same effort into replication and
data archiving they did into arguing with climate change deniers...
Third, Graham may be guilty of physics chauvinism (just like I'm
almost certainly guilty of bioinformatics chauvinism...) Physics and
biology are quite different: in physics, you often have a theoretical
framework to go by, and results should at least roughly adhere to that
or else they are considered guilty until proven innocent. In biology,
we usually have no good idea of what we're expecting to see, and often
we're looking at a system for the very first time. In that
environment, I think it's important to make the underlying computation
WAY more solid than you would demand in physics (see RNA editing above).
As Narayan Desai pointed out to me (following which I then put it in
my PyCon talk (slide 5)),
physics and biology are quite different in the way data is generated
and analyzed. There's fewer sources of data generation in physics,
there's more of a computational culture, and there's more theory.
Having worked with physicists for much of my scientific life (and
having published a number of papers with physicists) I can tell you
that replication is certainly a big problem over there, but the
consequences don't seem as big -- eventually the differences between
theory and computation will be worked out, because they're far more
noticeable when you have theory, like in physics. Not so in biology.
Fourth, a renewed emphasis on computational methods (and therefore on
replicability of computational results) is a natural part of the
transition to Big Data biology. The quality of
analysis methods matters A LOT when you[…]
computation
sciencecode
sequence analysis concept we call digital normalization. I think the paper is
pretty kick-ass, but so is the way in which we're approaching
replication. This blog post is about the latter.
(Quick note re "replication" vs "reproduction": The distinction
between replication and reproducibility is, from what I understand,
that "replicable" means "other people get exactly the same results
when doing exactly the same thing", while "reproducible" means
"something similar happens in other people's hands". The latter is
far stronger, in general, because it indicates that your results are
not merely some quirk of your setup and may actually be right.)
So what did we do to make this paper extra super replicable?
If you go to the paper Web site, you'll find:
a link to the paper itself, in preprint form, stored at the arXiv
site;
a tutorial for running the software on a Linux machine hosted in
the Amazon cloud;
a git repository for the software itself (hosted on github);
a git repository for the LaTeX paper and analysis scripts (also
hosted on github), including an ipython notebook for generating the
figures (more about that in my next blog post);
instructions on how to start up an EC2 cloud instance, install the
software and paper pipeline, and build most of the analyses and all
of the figures from scratch;
the data necessary to run the pipeline;
some of the output data discussed in the paper.
(Whew, it makes me a little tired just to type all that...)
What this means is that you can regenerate substantial amounts (but
not all) of the data and analyses underlying the paper from scratch,
all on your own, on a machine that you can rent for something like 50
cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to
re-run everything, plus some incidental costs for things like downloads.)
Not only can you do this, but if you try it, it will actually work.
I've done my best to make sure the darn thing works, and this is the
actual pipeline we ourselves ran to produce the figures in the paper.
All the data is there, and all of the code used to process the data,
analyze the results, and produce the figures is also there. In
version control.
When you combine that with the ability to run this on a specific EC2
instance -- a combination of a frozen virtual machine installation and
a specific set of hardware -- I feel pretty confident that at least
this component of our paper is something that can be replicated.
A few thoughts on replicability, and effort
Why did I go to all this trouble??
Wasn't it a lot of work?
Well, interestingly enough, it wasn't that much work. I already
use version control for everything, including paper text; posting it
all to github was a matter of about three commands.
Writing the code, analysis scripts, and paper was an immense amount of
work. But I had to do that anyway.
The most extra effort I put in was making sure that the big data files
were available. I didn't want to add the the 2gb E. coli resequencing
data set to git, for example. So I ended up tarballing those files
sticking them on S3.
The Makefile and analysis scripts are ugly, but suffice to remake
everything from scratch; they were already needed to make the paper,
so in order to post them all I had to do was put in a teensy bit of
effort to remove some unintentional dependencies.
The ipython notebook used to generate the figures (again -- next blog
post) was probably the most effort, because I had to learn how to use
it, which took about 20 minutes. But it was one of the smoothest
transitions into using a new tool I've ever experienced in my ~25 years
of coding.
Overall, it wasn't that much extra effort on my part.
Why bother in the first place??
The first and shortest answer is, because I could, and because I
believe in replication and reproducibility, and wanted to see how
tough it was to actually do something like this. (It's a good deal
above and beyond what most bioinformaticians do.)
Perhaps the strongest reason is that our group has been bitten a lot
in recent months by irreplicable results. I won't name names, but
several Science and PNAS and PLoS One papers of interest to us turned
out to be basically impossible for us to replicate. And, since we are
engaged in developing new computational methods that must be compared
to previous work, an inability to
regenerate exactly the results in those other papers meant we had to
work harder than we should have, simply to reproduce what they'd done.
A number of these problems came from people discarding large data sets
after publishing, under the mistaken belief that their submission to
the Short Read Archive could be used to regenerate their results.
(Often SRA submissions are unfiltered, and no one keeps the filtering
parameters around...right?) In some cases, I got the right data sets
from the authors and could replicate (kudos to Brian Haas of Trinity
for this!), but in most cases, ixnay on the eplicationre.
Then there were the cases where authors clearly were simply being bad
computational scientists. My favorite example is a very high profile
paper (coauthored by someone I admire greatly), in which the script
they sent to us -- a script necessary for the initial analyses -- had
a syntax error in it. In that case, we were fairly sure that the
authors weren't sending us the script they'd actually used... (It was
Perl, so admittedly it's hard to tell a syntax error from legitimate
code, but even the Perl interpreter was choking on this.)
(A few replication problems came from people using closed or
unpublished software, or being hand-wavy about the parameters they
used, or using version X of some Web-hosted pipeline for which only
version Y was now available. Clearly these are long-term issues that
need to be discussed with respect to replication in comp. bio., but
that's another topic.)
Thus, my group has wasted a lot of time replicating other people's
work. I wanted to avoid making other people go through that.
A third reason is that I really, really, really want to make it easy
for people to pick up this tool and use it. Digital normalization
is super ultra awesome and I want as little as possible to stand in
the way of others using it. So there's a strong element of
self-interest in doing things this way, and I hope it makes diginorm
more useful. (I know about a dozen people that have already tried it
out in the week or so since I made the paper available, which is
pretty cool. But citations will tell.)
What use is replication?
Way back when, Jim Graham politely schooled me in the true meaning of
reproducibility, as opposed to replication. He was about 2/3 right,
but then he went a bit too far and said
But let's drop the idea that I'm going to take your data and your
code and "reproduce" your result. I'm not. First, I've got my own
work to do. More importantly, the odds are that nobody will be any
wiser when I'm done."
Well, let's take a look at that concern, shall we?
With the benefit of about two years of further practice, I can tell
you this is a dangerously wrong way to think, at least in the field of
bioinformatics. My objections hinge on a few points:
First, based on our experiences so far, I'd be surprised if the
authors themselves could replicate their own computational results --
too many files and parameters are missing. We call that "bad
science".
Second, odds are, the senior professor has little or no detailed
understanding of what bioinformatic steps were taken in processing the
data, and moreover is uninterested in the details; that's why they're
not in the Methods. Why is that a problem? Because the odds are
quite good that many biological analyses hinge critically on such
points. So the peer reviewers and the community at large need to be
able to evaluate them (see this RNA editing kerfuffle for an
excellent example of reviewer fail). Yet most bioinformatic pipelines
are so terribly described that even with some WAG I can't figure out
what, roughly speaking, is going on. I certainly couldn't replicate
it, and generating specific critiques is quite difficult in that kind
of circumstance.
Parenthetically, Graham does refer to the climate sciences struggles
with reproducibility and replication. If only they put the same effort into replication and
data archiving they did into arguing with climate change deniers...
Third, Graham may be guilty of physics chauvinism (just like I'm
almost certainly guilty of bioinformatics chauvinism...) Physics and
biology are quite different: in physics, you often have a theoretical
framework to go by, and results should at least roughly adhere to that
or else they are considered guilty until proven innocent. In biology,
we usually have no good idea of what we're expecting to see, and often
we're looking at a system for the very first time. In that
environment, I think it's important to make the underlying computation
WAY more solid than you would demand in physics (see RNA editing above).
As Narayan Desai pointed out to me (following which I then put it in
my PyCon talk (slide 5)),
physics and biology are quite different in the way data is generated
and analyzed. There's fewer sources of data generation in physics,
there's more of a computational culture, and there's more theory.
Having worked with physicists for much of my scientific life (and
having published a number of papers with physicists) I can tell you
that replication is certainly a big problem over there, but the
consequences don't seem as big -- eventually the differences between
theory and computation will be worked out, because they're far more
noticeable when you have theory, like in physics. Not so in biology.
Fourth, a renewed emphasis on computational methods (and therefore on
replicability of computational results) is a natural part of the
transition to Big Data biology. The quality of
analysis methods matters A LOT when you[…]
26 days ago by Vnoel
DCPU-16 spec
27 days ago by brendn
Notch's DCPU-16 spec for that new game...
computation
hardware
processor
esoteric
1010
27 days ago by brendn
Why the New Aesthetic isn’t about 8bit retro, the Robot Readable World, computer vision and pirates
5 weeks ago by jamesmnw
"3D Wireframes were around 30 years ago, solid & textured 3D shortly after and still all done in software. 20 years ago some of these calculations moved onto GPUs on dedicated 3D graphics cards. Computer vision it’s all still done in software, and we’re roughly up-to depth, joints, colour & shading detection, if the evolution was on par with graphics we’d start to see the first few dedicated vision cards appearing on the market for consumer use. Or put another way, current computer vision can probably “see” computer graphics from around 20-30 years ago. Which in turn means to design for machine eyes we need to be at the level of computer graphics from the 8bit era, and so we have QR codes all over the place."
new_aesthetic
design
robotics
computation
vision
graphics
pirates
3D
art
tech_art
technology
from delicious
5 weeks ago by jamesmnw
Fractal Clockwork
5 weeks ago by mtchl
Laser-cut fractal clockwork computer, via @ponoko. Also some interesting musings on the fundamentals of computation.
fabrication
lasercut
computation
materiality
fractal
complexity
from twitter
5 weeks ago by mtchl
Automatic Differentiation: The most criminally underused tool in the potential machine learning toolbox? « Justin Domke’s Weblog
6 weeks ago by classy.dk
This is awesome. Need to research autodiff tools for my fav languages.....
machinelearning
machine-learning
algorithms
computation
mathematics
6 weeks ago by classy.dk
VIFF, the Virtual Ideal Functionality Framework
7 weeks ago by Boinside
VIFF is a framework which allows you to specify secure multi-party computations in a clean and easy way.
distributed
computation
secure
python
framework
7 weeks ago by Boinside
Copy this bookmark: