nhaliday + org:com   69

Three best practices for building successful data pipelines - O'Reilly Media
Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:
1. Reproducible
2. Consistent
3. Productionizable

...

Science that cannot be reproduced by an external third party is just not science — and this does apply to data science. One of the benefits of working in data science is the ability to apply the existing tools from software engineering. These tools let you isolate all the dependencies of your analyses and make them reproducible.

Dependencies fall into three categories:
1. Analysis code ...
2. Data sources ...
3. Algorithmic randomness ...

...

Establishing consistency in data
...

There are generally two ways of establishing the consistency of data sources. The first is by checking-in all code and data into a single revision control repository. The second method is to reserve source control for code and build a pipeline that explicitly depends on external data being in a stable, consistent format and location.

Checking data into version control is generally considered verboten for production software engineers, but it has a place in data analysis. For one thing, it makes your analysis very portable by isolating all dependencies into source control. Here are some conditions under which it makes sense to have both code and data in source control:
Small data sets ...
Regular analytics ...
Fixed source ...

Productionizability: Developing a common ETL
...

1. Common data format ...
2. Isolating library dependencies ...

https://blog.koresoftware.com/blog/etl-principles
Rigorously enforce the idempotency constraint
For efficiency, seek to load data incrementally
Always ensure that you can efficiently process historic data
Partition ingested data at the destination
Rest data between tasks
Pool resources for efficiency
Store all metadata together in one place
Manage login details in one place
Specify configuration details once
Parameterize sub flows and dynamically run tasks where possible
Execute conditionally
Develop your own workflow framework and reuse workflow components

more focused on details of specific technologies:
https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7

https://www.cloudera.com/documentation/director/cloud/topics/cloud_de_best_practices.html
techtariat  org:com  best-practices  engineering  code-organizing  machine-learning  data-science  yak-shaving  nitty-gritty  workflow  config  vcs  replication  homo-hetero  multi  org:med  design  system-design  links  shipping  minimalism  volo-avolo  causation  random  invariance  structure  arrows  protocol-metadata 
6 weeks ago by nhaliday
The Law of Leaky Abstractions – Joel on Software
[TCP/IP example]

All non-trivial abstractions, to some degree, are leaky.

...

- Something as simple as iterating over a large two-dimensional array can have radically different performance if you do it horizontally rather than vertically, depending on the “grain of the wood” — one direction may result in vastly more page faults than the other direction, and page faults are slow. Even assembly programmers are supposed to be allowed to pretend that they have a big flat address space, but virtual memory means it’s really just an abstraction, which leaks when there’s a page fault and certain memory fetches take way more nanoseconds than other memory fetches.

- The SQL language is meant to abstract away the procedural steps that are needed to query a database, instead allowing you to define merely what you want and let the database figure out the procedural steps to query it. But in some cases, certain SQL queries are thousands of times slower than other logically equivalent queries. A famous example of this is that some SQL servers are dramatically faster if you specify “where a=b and b=c and a=c” than if you only specify “where a=b and b=c” even though the result set is the same. You’re not supposed to have to care about the procedure, only the specification. But sometimes the abstraction leaks and causes horrible performance and you have to break out the query plan analyzer and study what it did wrong, and figure out how to make your query run faster.

...

- C++ string classes are supposed to let you pretend that strings are first-class data. They try to abstract away the fact that strings are hard and let you act as if they were as easy as integers. Almost all C++ string classes overload the + operator so you can write s + “bar” to concatenate. But you know what? No matter how hard they try, there is no C++ string class on Earth that will let you type “foo” + “bar”, because string literals in C++ are always char*’s, never strings. The abstraction has sprung a leak that the language doesn’t let you plug. (Amusingly, the history of the evolution of C++ over time can be described as a history of trying to plug the leaks in the string abstraction. Why they couldn’t just add a native string class to the language itself eludes me at the moment.)

- And you can’t drive as fast when it’s raining, even though your car has windshield wipers and headlights and a roof and a heater, all of which protect you from caring about the fact that it’s raining (they abstract away the weather), but lo, you have to worry about hydroplaning (or aquaplaning in England) and sometimes the rain is so strong you can’t see very far ahead so you go slower in the rain, because the weather can never be completely abstracted away, because of the law of leaky abstractions.

One reason the law of leaky abstractions is problematic is that it means that abstractions do not really simplify our lives as much as they were meant to. When I’m training someone to be a C++ programmer, it would be nice if I never had to teach them about char*’s and pointer arithmetic. It would be nice if I could go straight to STL strings. But one day they’ll write the code “foo” + “bar”, and truly bizarre things will happen, and then I’ll have to stop and teach them all about char*’s anyway.

...

The law of leaky abstractions means that whenever somebody comes up with a wizzy new code-generation tool that is supposed to make us all ever-so-efficient, you hear a lot of people saying “learn how to do it manually first, then use the wizzy tool to save time.” Code generation tools which pretend to abstract out something, like all abstractions, leak, and the only way to deal with the leaks competently is to learn about how the abstractions work and what they are abstracting. So the abstractions save us time working, but they don’t save us time learning.
techtariat  org:com  working-stiff  essay  programming  cs  software  abstraction  worrydream  thinking  intricacy  degrees-of-freedom  networking  examples  traces  no-go  volo-avolo  tradeoffs  c(pp)  pls  strings  dbs  transportation  driving  analogy  aphorism  learning  paradox  systems  elegance  nitty-gritty  concrete  cracker-prog  metal-to-virtual  protocol-metadata 
10 weeks ago by nhaliday
Cleaner, more elegant, and harder to recognize | The Old New Thing
Really easy
Writing bad error-code-based code
Writing bad exception-based code

Hard
Writing good error-code-based code

Really hard
Writing good exception-based code

--

Really easy
Recognizing that error-code-based code is badly-written
Recognizing the difference between bad error-code-based code and
not-bad error-code-based code.

Hard
Recognizing that error-code-base code is not badly-written

Really hard
Recognizing that exception-based code is badly-written
Recognizing that exception-based code is not badly-written
Recognizing the difference between bad exception-based code
and not-bad exception-based code

https://ra3s.com/wordpress/dysfunctional-programming/2009/07/15/return-code-vs-exception-handling/
https://nedbatchelder.com/blog/200501/more_exception_handling_debate.html
techtariat  org:com  microsoft  working-stiff  pragmatic  carmack  error  error-handling  programming  rhetoric  debate  critique  pls  search  structure  cost-benefit  comparison  summary  intricacy  certificates-recognition  commentary  multi  contrarianism  correctness  quality  code-dive  cracker-prog 
10 weeks ago by nhaliday
From Java 8 to Java 11 - Quick Guide - Codete blog
notable:
jshell, Optional methods, var (type inference), {List,Set,Map}.copyOf, `java $PROGRAM.java` execution
programming  cheatsheet  reference  comparison  jvm  pls  oly-programming  gotchas  summary  flux-stasis  marginal  org:com 
11 weeks ago by nhaliday
Stack Overflow Developer Survey 2018
Rust, Python, Go in top most loved
F#/OCaml most high paying globally, Erlang/Scala/OCaml in the US (F# still in top 10)
ML specialists high-paid
editor usage: VSCode > VS > Sublime > Vim > Intellij >> Emacs
ranking  list  top-n  time-series  data  database  programming  engineering  pls  trends  stackex  poll  career  exploratory  network-structure  ubiquity  ocaml-sml  rust  golang  python  dotnet  money  jobs  compensation  erlang  scala  jvm  ai  ai-control  risk  futurism  ethical-algorithms  data-science  machine-learning  editors  devtools  tools  pro-rata  org:com 
december 2018 by nhaliday
My Heroic and Lazy Stand Against IFTTT (Pinboard Blog)
Imagine if your sewer pipe started demanding that you make major changes in your diet.

Now imagine that it got a lawyer and started asking you to sign things.

You would feel surprised.
diogenes  lol  pinboard  tech  sv  techtariat  integration-extension  org:com 
july 2016 by nhaliday
The Next Generation of Software Stacks | StackShare
most interesting part to me:
GECS have a clear bias towards certain types of applications and services as well. These preferences are particularly apparent in the analytics stack. Tools typically aimed primarily at marketing teams—tools like Crazy Egg, Optimizely, and Google Analytics, the most popular tool on Stackshare—are extremely unpopular among GECS. These services are being replaced by tools that are aimed a serving both marketing and analytics teams. Segment, Mixpanel, Heap, and Amplitude, which provide flexible access to raw data, are well-represented among GECS, suggesting that these companies are looking to understand user behavior beyond clicks and page views.
data  analysis  business  startups  tech  planning  techtariat  org:com  ecosystem  software  saas  network-structure  integration-extension  cloud  github  oss  vcs  amazon  communication  trends  pro-rata  crosstab  visualization  sv  programming  pls  web  javascript  frontend  marketing 
april 2016 by nhaliday
YC's 2015 Reading List · The Macro
The Lost Art of Finding Our Way is my favorite suggestion
list  books  startups  yc  recommendations  spatial  🖥  top-n  2015  techtariat  barons  org:com  fiction  play 
december 2015 by nhaliday

related tags

abstraction  acm  acmtariat  advanced  advice  age-generation  aging  ai  ai-control  akrasia  amazon  analogy  analysis  announcement  anthropology  aphorism  apple  applications  arbitrage  arrows  asia  atoms  authoritarianism  auto-learning  automata-languages  automation  barons  benchmarks  best-practices  biotech  bitcoin  blockchain  blog  blowhards  books  bots  brands  build-packaging  business  c(pp)  capital  career  carmack  causation  certificates-recognition  cheatsheet  checking  checklists  cloud  cocktail  cocoa  code-dive  code-organizing  collaboration  commentary  communication  community  comparison  compensation  compression  computer-memory  computer-vision  concentration-of-measure  concept  concrete  concurrency  config  contracts  contrarianism  cool  corporation  correctness  cost-benefit  coupling-cohesion  cracker-prog  creative  critique  crosstab  crypto  cryptocurrency  cs  culture  data  data-science  database  dataset  dbs  debate  debt  debugging  deep-learning  deepgoog  degrees-of-freedom  design  devops  devtools  diogenes  direct-indirect  dirty-hands  distributed  distribution  dotnet  driving  dropbox  duplication  duty  ecosystem  editors  egalitarianism-hierarchy  elegance  engineering  ensembles  entrepreneurialism  erlang  error  error-handling  essay  ethical-algorithms  examples  expectancy  exploratory  facebook  features  fiction  flexibility  flux-stasis  foreign-lang  forum  frontend  frontier  functional  futurism  game-theory  games  gender  gender-diff  genetics  genomics  geography  github  golang  google  gotchas  gradient-descent  grokkability  guide  GWAS  happy-sad  hard-tech  haskell  hci  heavy-industry  heterodox  high-variance  history  hmm  hn  homo-hetero  howto  human-capital  human-ml  humility  hypothesis-testing  idk  impact  impetus  individualism-collectivism  infographics  infrastructure  init  innovation  integration-extension  interdisciplinary  internet  interview  interview-prep  intricacy  invariance  investing  ios  japan  javascript  jobs  jvm  labor  language  law  learning  libraries  limits  linearity  liner-notes  links  list  lol  long-short-run  long-term  machine-learning  magnitude  management  maps  marginal  marketing  matching  matrix-factorization  meaningness  memory-management  metabuch  metal-to-virtual  methodology  microbiz  microsoft  migration  minimalism  minimum-viable  model-class  money  multi  music  n-factor  negotiation  network-structure  networking  neuro  nibble  nitty-gritty  nlp  no-go  nonparametric  numerics  ocaml-sml  oly  oly-programming  oop  orders  org:biz  org:bleg  org:com  org:med  organization  os  oss  osx  papers  paradox  parametric  parsimony  people  performance  pinboard  planning  play  pls  plt  poll  postmortem  pragmatic  presentation  pro-rata  probability  product-management  productivity  programming  progression  project  protocol-metadata  python  q-n-a  qra  quality  quantum  quixotic  quora  quotes  random  ranking  rant  recommendations  recruiting  reference  reflection  reinforcement  replication  research  retention  retrofit  review  rhetoric  risk  roadmap  roots  rust  saas  sales  scala  scale  scaling-tech  search  securities  security  sex  shipping  SIGGRAPH  similarity  sinosphere  skunkworks  social-norms  society  software  spatial  speedometer  stackex  stanford  startups  state-of-art  static-dynamic  stats  status  stories  strategy  stream  strings  structure  subculture  summary  sv  system-design  systems  tactics  taxes  tech  tech-infrastructure  technocracy  techtariat  the-world-is-just-atoms  thinking  time-complexity  time-preference  time-series  tip-of-tongue  tools  top-n  traces  trade  tradeoffs  tradition  transitions  transportation  trends  tutorial  ubiquity  ui  unit  universalism-particularism  unsupervised  urban  urban-rural  ux  vcs  venture  video  virtualization  visualization  volo-avolo  vulgar  web  wire-guided  woah  workflow  working-stiff  worrydream  yak-shaving  yc  🖥 

Copy this bookmark:



description:


tags: