jm + version-control   12

'Containerized Data Analytics':
There are two bold new ideas in Pachyderm:

Containers as the core processing primitive
Version Control for data

These ideas lead directly to a system that's much more powerful, flexible and easy to use.

To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).

Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes and diffs, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!

Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!
analytics  data  containers  golang  pachyderm  tools  data-science  docker  version-control 
4 weeks ago by jm
git for Cloud Storage. Create distributed, decentralized and versioned repositories that scale infinitely to 100s of millions of files and PBs of storage. Huge repos can be cloned on your local SSD for making changes, committing and pushing back. Oh yeah, and it dedupes too due to BLAKE2 Tree hashing.
git  ops  storage  cloud  s3  disk  aws  version-control  blake2 
april 2016 by jm
git integrity - Google Groups
It seems git's default behavior in many situations is -- despite communicating objectID by content-addressable hashes which should be sufficient to assure some integrity -- it may not actually bother to *check* them.  Yes, even when receiving objects from other repos.  So, enabling these configuration parameters may "slow down" your git operations.  The return is actually noticing if someone ships you a bogus object.  Everyone should enable these.
git  security  integrity  error-checking  dvcs  version-control  coding 
february 2016 by jm
'The multiple repository tool'. How Google kludged around the split-repo problem when you don't have a monorepo.
kludges  git  monorepo  monorepi  google  android  aosp  repo  coding  version-control  dvcs 
may 2015 by jm
'Continuous Deployment: The Dirty Details'
Good slide deck from Etsy's Mike Brittain regarding their CD setup. Some interesting little-known details:

Slide 41: database schema changes are not CD'd -- they go out on "Schema change Thursdays".

Slide 44: only the webapp is CD'd -- PHP, Apache, memcache components (, support and back-office tools, developer API, gearman async worker queues). The external "services" are not -- databases, Solr/JVM search (rolling restarts), photo storage (filters, proxy cache, S3), payments (PCI-DSS, controlled access).

They avoid schema changes and breaking changes using an approach they call "non-breaking expansions" -- expose new version in a service interface; support multiple versions in the consumer. Example from slides 50-63, based around a database schema migration.

Slide 66: "dev flags" (rollout oriented) are promoted to "feature flags" (long lived degradation control).

Slide 71: some architectural philosophies: deploying is cheap; releasing is cheap; gathering data should be cheap too; treat first iterations as experiments.

Slide 102: "Canary pools". They have multiple pools of users for testing in production -- the staff pool, users who have opted in to see prototypes/beta stuff, 0-100% gradual phased rollout.
cd  deploy  etsy  slides  migrations  database  schema  ops  ci  version-control  feature-flags 
april 2015 by jm
Bug Prediction at Google
LOL. grepping commit logs for /bug|fix/ does the job, apparently:
In the literature, Rahman et al. found that a very cheap algorithm actually performs almost as well as some very expensive bug-prediction algorithms. They found that simply ranking files by the number of times they've been changed with a bug-fixing commit (i.e. a commit which fixes a bug) will find the hot spots in a code base. Simple! This matches our intuition: if a file keeps requiring bug-fixes, it must be a hot spot because developers are clearly struggling with it.
bugs  rahman-algorithm  heuristics  source-code-analysis  coding  algorithms  google  static-code-analysis  version-control 
march 2015 by jm
Git is not scalable with too many refs/*
Mailing list thread from 2011; git starts to keel over if you tag too much
git  tags  coding  version-control  bugs  scaling  refs 
february 2014 by jm
GitHub Archive
a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. GitHub provides 18 event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly [gzipped JSON] archives, which you can access with any HTTP client.
github  data  git  history  version-control  oss  archival 
march 2013 by jm
Large file management with git-annex
'uses Git to manage files that are larger than Git can easily handle—without checking them into the repository. But git-annex provides ways to track those files using much of the same infrastructure as Git, so that moving or deleting those files can all be tracked in much the same way as committed files. In addition, git-annex allows for branches and distributed clones of its trees.' I may investigate using this to sync my MP3s instead of SVN
git  git-annex  version-control 
december 2011 by jm
CPAN and BackPAN, as a set of git repositories; essentially a read-only view of all CPAN releases, ever. good plan; I like the way git is useful as a kind of general-purpose distributed archive system
git  gitpan  cpan  backpan  perl  releases  archives  history  version-control  from delicious
june 2010 by jm

Copy this bookmark: