Live Free or Dichotomize - Using AWK and R to parse 25tb

How to read this post: I sincerely apologize for how long and rambling the following text is. To speed up skimming of it for those who have better things to do with their time, I have started most sections with a “Lesson learned” blurb that boils down the takeaway from the following text into a sentence or two.
Every generation, these techniques are rediscovered. Note how Apache Spark choked out!
A story of going from Spark, 8mins, and $20 per AWS query, to mostly R+Awk, 0.1s, and $0.0001 per query.

Pre-processing a massive (25tb) amount of DNA(?) data into a format easily queryable on AWS.
Using AWK and R to parse 25tb

