nhaliday + config   23

syntax highlighting - List known filetypes - Vi and Vim Stack Exchange
Type :setfiletype (with a space afterwards), then press Ctrl-d.
q-n-a  stackex  editors  howto  list  pls  config 
8 weeks ago by nhaliday
Three best practices for building successful data pipelines - O'Reilly Media
Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:
1. Reproducible
2. Consistent
3. Productionizable

...

Science that cannot be reproduced by an external third party is just not science — and this does apply to data science. One of the benefits of working in data science is the ability to apply the existing tools from software engineering. These tools let you isolate all the dependencies of your analyses and make them reproducible.

Dependencies fall into three categories:
1. Analysis code ...
2. Data sources ...
3. Algorithmic randomness ...

...

Establishing consistency in data
...

There are generally two ways of establishing the consistency of data sources. The first is by checking-in all code and data into a single revision control repository. The second method is to reserve source control for code and build a pipeline that explicitly depends on external data being in a stable, consistent format and location.

Checking data into version control is generally considered verboten for production software engineers, but it has a place in data analysis. For one thing, it makes your analysis very portable by isolating all dependencies into source control. Here are some conditions under which it makes sense to have both code and data in source control:
Small data sets ...
Regular analytics ...
Fixed source ...

Productionizability: Developing a common ETL
...

1. Common data format ...
2. Isolating library dependencies ...

https://blog.koresoftware.com/blog/etl-principles
Rigorously enforce the idempotency constraint
For efficiency, seek to load data incrementally
Always ensure that you can efficiently process historic data
Partition ingested data at the destination
Rest data between tasks
Pool resources for efficiency
Store all metadata together in one place
Manage login details in one place
Specify configuration details once
Parameterize sub flows and dynamically run tasks where possible
Execute conditionally
Develop your own workflow framework and reuse workflow components

more focused on details of specific technologies:
https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7

https://www.cloudera.com/documentation/director/cloud/topics/cloud_de_best_practices.html
techtariat  org:com  best-practices  engineering  code-organizing  machine-learning  data-science  yak-shaving  nitty-gritty  workflow  config  vcs  replication  homo-hetero  multi  org:med  design  system-design  links  shipping  minimalism  volo-avolo  causation  random  invariance  structure  arrows  protocol-metadata 
10 weeks ago by nhaliday
The Setup / Gary Bernhardt
In the summer of 2013, I became afraid of RSI and preventatively switched to an Evoluent vertical mouse, which I've been pretty happy with (though I wish I'd gotten the wired version). I also switched both my keyboard geometry and my keyboard layout, which is a much more extreme change.

My keyboard is a full-hand ErgoDox. It looks roughly like the one on the Massdrop assembly page except that my case is longer, extending down from the bottom to form a built-in wrist rest. It has the notoriously clicky Cherry blue switches. I assembled it myself, which required a couple hundred solder joints. You can buy them pre-assembled now, I think, but I enjoyed the process (and I'm now confident that I can repair any problem with it).

...

Backup is a little complicated. I back up to Amazon S3/Glacier using Arq and to a local Time Capsule using Time Machine. Both of those run hourly and store backup history.

I also make two clones of my full drive: one to a bootable USB drive, and another to the Time Capsule (separate from the Time Machine history). Both are done using SuperDuper. Then, just for good measure, I clone the entire 1.5 GB Time Capsule to another USB drive via rsync. The whole SuperDuper/rsync process happens every two weeks.

The "why" of all of that is a long story, but that's roughly the minimum configuration that I consider fairly safe and easily recoverable after catastrophic failure. The two hourly backup systems involved -- Arq and Time Machine -- have failed completely multiple times, losing or, in Time Machine's cases, corrupting all of my backups without alerting me. The causes of those failures remain uncorrected, so they will surely happen again. SuperDuper hasn't failed, but it's also not a storage system itself and its backups have no history.
software  techtariat  devtools  interview  tools  list  engineering  profile  people  summer-2014  programming  app  gtd  desktop  osx  terminal  security  opsec  recommendations  email  backup  python  pls  ergo  best-practices  config  diogenes 
august 2014 by nhaliday
The Setup / Russ Cox
I swear by the small Apple keyboard (in stores they have one that size with a USB cable too) and the Evoluent mouse.

...

I run acme full screen as my day to day work environment. It serves the role of editor, terminal, and window system. It's hard to get a feel for it without using it, but this video helps a little.

Rob Pike's sam editor deserves special mention too. From a UI standpoint, it's a graphical version of ed, which you either love or hate, but it does two things better than any other editor I know. First, it is a true multi-file editor. I have used it to edit thousands of files at a time, interactively. Second, and even more important, it works insanely well over low-bandwidth, high-latency connections. I can run sam in Boston to edit files in Sydney over ssh connections where the round trip time would make vi or emacs unusable. Sam runs as two halves: the UI half runs locally and knows about the sections of the file that are on or near the screen, the back end half runs near the files, and the two halves communicate using a well-engineered custom protocol. The original target environment was 1200 bps modem lines in the early 1980s, so it's a little surprising how relevant the design remains, but in fact, it's the same basic design used by any significant JavaScript application on the web today. Finally, sam is the editor of choice for both Ken Thompson and Bjarne Stroustroup. If you can satisfy both of them, you're doing something right.

...

I use Unison to sync files between my various computers. Dropbox seems to be the hot new thing, but I like that Unison doesn't ever store my files on someone else's computers.

...

I want to be working on my home desktop, realize what time it is, run out the door to catch my train, open my laptop on the train, continue right where I left off, close the laptop, hop off the train, sit down at work, and have all my state sitting there on the monitor on my desk, all without even thinking about it.
programming  hardware  plan9  rsc  software  recommendations  techtariat  devtools  worse-is-better/the-right-thing  nostalgia  summer-2014  interview  ergo  osx  linux  desktop  consumerism  people  editors  tools  list  google  cloud  os  profile  summary  c(pp)  networking  performance  distributed  config  cracker-prog  heavyweights  unix 
july 2014 by nhaliday

related tags

advertising  aggregator  analysis  aphorism  app  arrows  backup  bayesian  best-practices  biases  books  browser  build-packaging  c(pp)  carmack  causation  cheatsheet  checking  checklists  cloud  coalitions  code-organizing  collaboration  commentary  community  comparison  composition-decomposition  computer-memory  config  consumerism  contrarianism  coordination  cost-benefit  cracker-prog  creative  critique  culture  dan-luu  data  data-science  debugging  deep-learning  design  desktop  devops  devtools  diogenes  discipline  distributed  distribution  divide-and-conquer  documentation  DSL  duplication  dynamic  editors  education  elegance  email  empirical  engineering  ergo  error  error-handling  ethics  exocortex  expert-experience  extra-introversion  facebook  formal-values  forum  frontend  git  golang  google  gtd  guide  gwern  hacker  hardware  haskell  heavyweights  heuristic  hmm  hn  homo-hetero  howto  ide  init  integration-extension  internet  interview  invariance  iq  javascript  keyboard  keyboards  lesswrong  libraries  links  linux  list  machine-learning  media  minimalism  money  mooc  morality  multi  networking  nitty-gritty  nostalgia  notetaking  objektbuch  open-closed  opsec  org:com  org:edu  org:junk  org:med  os  oss  osx  paste  people  performance  personality  philosophy  plan9  pls  politics  postmortem  pragmatic  prediction  priors-posteriors  privacy  profile  programming  protocol-metadata  psych-architecture  psychometrics  python  q-n-a  quality  quiz  quotes  random  rationality  ratty  recommendations  reddit  reference  reflection  reinforcement  replication  repo  responsibility  retention  review  rhetoric  robust  rsc  sanctity-degradation  scaling-tech  security  shipping  sleep  social  software  stackex  stats  stress  structure  summary  summer-2014  syntax  system-design  techtariat  terminal  things  tools  tutorial  tv  ui  unix  vcs  virginia-DC  visuo  volo-avolo  vulgar  web  wiki  workflow  worse-is-better/the-right-thing  yak-shaving  🖥 

Copy this bookmark:



description:


tags: