Type :setfiletype (with a space afterwards), then press Ctrl-d.
Three best practices for building successful data pipelines - O'Reilly Media
Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:
1. Reproducible
2. Consistent
3. Productionizable


Science that cannot be reproduced by an external third party is just not science — and this does apply to data science. One of the benefits of working in data science is the ability to apply the existing tools from software engineering. These tools let you isolate all the dependencies of your analyses and make them reproducible.

Dependencies fall into three categories:
1. Analysis code ...
2. Data sources ...
3. Algorithmic randomness ...


Establishing consistency in data

There are generally two ways of establishing the consistency of data sources. The first is by checking-in all code and data into a single revision control repository. The second method is to reserve source control for code and build a pipeline that explicitly depends on external data being in a stable, consistent format and location.

Checking data into version control is generally considered verboten for production software engineers, but it has a place in data analysis. For one thing, it makes your analysis very portable by isolating all dependencies into source control. Here are some conditions under which it makes sense to have both code and data in source control:
Small data sets ...
Regular analytics ...
Fixed source ...

Productionizability: Developing a common ETL

1. Common data format ...
2. Isolating library dependencies ...

Rigorously enforce the idempotency constraint
For efficiency, seek to load data incrementally
Always ensure that you can efficiently process historic data
Partition ingested data at the destination
Rest data between tasks
Pool resources for efficiency
Store all metadata together in one place
Manage login details in one place
Specify configuration details once
Parameterize sub flows and dynamically run tasks where possible
Execute conditionally
Develop your own workflow framework and reuse workflow components

more focused on details of specific technologies:

The Setup / Gary Bernhardt
In the summer of 2013, I became afraid of RSI and preventatively switched to an Evoluent vertical mouse, which I've been pretty happy with (though I wish I'd gotten the wired version). I also switched both my keyboard geometry and my keyboard layout, which is a much more extreme change.

My keyboard is a full-hand ErgoDox. It looks roughly like the one on the Massdrop assembly page except that my case is longer, extending down from the bottom to form a built-in wrist rest. It has the notoriously clicky Cherry blue switches. I assembled it myself, which required a couple hundred solder joints. You can buy them pre-assembled now, I think, but I enjoyed the process (and I'm now confident that I can repair any problem with it).


Backup is a little complicated. I back up to Amazon S3/Glacier using Arq and to a local Time Capsule using Time Machine. Both of those run hourly and store backup history.

I also make two clones of my full drive: one to a bootable USB drive, and another to the Time Capsule (separate from the Time Machine history). Both are done using SuperDuper. Then, just for good measure, I clone the entire 1.5 GB Time Capsule to another USB drive via rsync. The whole SuperDuper/rsync process happens every two weeks.

The "why" of all of that is a long story, but that's roughly the minimum configuration that I consider fairly safe and easily recoverable after catastrophic failure. The two hourly backup systems involved -- Arq and Time Machine -- have failed completely multiple times, losing or, in Time Machine's cases, corrupting all of my backups without alerting me. The causes of those failures remain uncorrected, so they will surely happen again. SuperDuper hasn't failed, but it's also not a storage system itself and its backups have no history.
The Setup / Russ Cox
I swear by the small Apple keyboard (in stores they have one that size with a USB cable too) and the Evoluent mouse.


I run acme full screen as my day to day work environment. It serves the role of editor, terminal, and window system. It's hard to get a feel for it without using it, but this video helps a little.

Rob Pike's sam editor deserves special mention too. From a UI standpoint, it's a graphical version of ed, which you either love or hate, but it does two things better than any other editor I know. First, it is a true multi-file editor. I have used it to edit thousands of files at a time, interactively. Second, and even more important, it works insanely well over low-bandwidth, high-latency connections. I can run sam in Boston to edit files in Sydney over ssh connections where the round trip time would make vi or emacs unusable. Sam runs as two halves: the UI half runs locally and knows about the sections of the file that are on or near the screen, the back end half runs near the files, and the two halves communicate using a well-engineered custom protocol. The original target environment was 1200 bps modem lines in the early 1980s, so it's a little surprising how relevant the design remains, but in fact, it's the same basic design used by any significant JavaScript application on the web today. Finally, sam is the editor of choice for both Ken Thompson and Bjarne Stroustroup. If you can satisfy both of them, you're doing something right.


I use Unison to sync files between my various computers. Dropbox seems to be the hot new thing, but I like that Unison doesn't ever store my files on someone else's computers.


I want to be working on my home desktop, realize what time it is, run out the door to catch my train, open my laptop on the train, continue right where I left off, close the laptop, hop off the train, sit down at work, and have all my state sitting there on the monitor on my desk, all without even thinking about it.
