jm + version-control 12
pachyderm
february 2017 by jm
'Containerized Data Analytics':
analytics
data
containers
golang
pachyderm
tools
data-science
docker
version-control
There are two bold new ideas in Pachyderm:
Containers as the core processing primitive
Version Control for data
These ideas lead directly to a system that's much more powerful, flexible and easy to use.
To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).
Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes and diffs, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!
Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!
february 2017 by jm
s3git
git
ops
storage
cloud
s3
disk
aws
version-control
blake2
april 2016 by jm
git for Cloud Storage. Create distributed, decentralized and versioned repositories that scale infinitely to 100s of millions of files and PBs of storage. Huge repos can be cloned on your local SSD for making changes, committing and pushing back. Oh yeah, and it dedupes too due to BLAKE2 Tree hashing. http://s3git.org
april 2016 by jm
git integrity - Google Groups
git
security
integrity
error-checking
dvcs
version-control
coding
february 2016 by jm
It seems git's default behavior in many situations is -- despite communicating objectID by content-addressable hashes which should be sufficient to assure some integrity -- it may not actually bother to *check* them. Yes, even when receiving objects from other repos. So, enabling these configuration parameters may "slow down" your git operations. The return is actually noticing if someone ships you a bogus object. Everyone should enable these.
february 2016 by jm
'Continuous Deployment: The Dirty Details'
april 2015 by jm
Good slide deck from Etsy's Mike Brittain regarding their CD setup. Some interesting little-known details:
Slide 41: database schema changes are not CD'd -- they go out on "Schema change Thursdays".
Slide 44: only the webapp is CD'd -- PHP, Apache, memcache components (Etsy.com, support and back-office tools, developer API, gearman async worker queues). The external "services" are not -- databases, Solr/JVM search (rolling restarts), photo storage (filters, proxy cache, S3), payments (PCI-DSS, controlled access).
They avoid schema changes and breaking changes using an approach they call "non-breaking expansions" -- expose new version in a service interface; support multiple versions in the consumer. Example from slides 50-63, based around a database schema migration.
Slide 66: "dev flags" (rollout oriented) are promoted to "feature flags" (long lived degradation control).
Slide 71: some architectural philosophies: deploying is cheap; releasing is cheap; gathering data should be cheap too; treat first iterations as experiments.
Slide 102: "Canary pools". They have multiple pools of users for testing in production -- the staff pool, users who have opted in to see prototypes/beta stuff, 0-100% gradual phased rollout.
cd
deploy
etsy
slides
migrations
database
schema
ops
ci
version-control
feature-flags
Slide 41: database schema changes are not CD'd -- they go out on "Schema change Thursdays".
Slide 44: only the webapp is CD'd -- PHP, Apache, memcache components (Etsy.com, support and back-office tools, developer API, gearman async worker queues). The external "services" are not -- databases, Solr/JVM search (rolling restarts), photo storage (filters, proxy cache, S3), payments (PCI-DSS, controlled access).
They avoid schema changes and breaking changes using an approach they call "non-breaking expansions" -- expose new version in a service interface; support multiple versions in the consumer. Example from slides 50-63, based around a database schema migration.
Slide 66: "dev flags" (rollout oriented) are promoted to "feature flags" (long lived degradation control).
Slide 71: some architectural philosophies: deploying is cheap; releasing is cheap; gathering data should be cheap too; treat first iterations as experiments.
Slide 102: "Canary pools". They have multiple pools of users for testing in production -- the staff pool, users who have opted in to see prototypes/beta stuff, 0-100% gradual phased rollout.
april 2015 by jm
Bug Prediction at Google
march 2015 by jm
LOL. grepping commit logs for /bug|fix/ does the job, apparently:
bugs
rahman-algorithm
heuristics
source-code-analysis
coding
algorithms
google
static-code-analysis
version-control
In the literature, Rahman et al. found that a very cheap algorithm actually performs almost as well as some very expensive bug-prediction algorithms. They found that simply ranking files by the number of times they've been changed with a bug-fixing commit (i.e. a commit which fixes a bug) will find the hot spots in a code base. Simple! This matches our intuition: if a file keeps requiring bug-fixes, it must be a hot spot because developers are clearly struggling with it.
march 2015 by jm
Git is not scalable with too many refs/*
february 2014 by jm
Mailing list thread from 2011; git starts to keel over if you tag too much
git
tags
coding
version-control
bugs
scaling
refs
february 2014 by jm
GitHub Archive
github
data
git
history
version-control
oss
archival
march 2013 by jm
a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. GitHub provides 18 event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly [gzipped JSON] archives, which you can access with any HTTP client.
march 2013 by jm
How to revert a faulty merge in git
february 2013 by jm
omgwtf, this is pretty horrific.
merging
git
merge
omgwtf
version-control
branching
february 2013 by jm
Large file management with git-annex
december 2011 by jm
'uses Git to manage files that are larger than Git can easily handle—without checking them into the repository. But git-annex provides ways to track those files using much of the same infrastructure as Git, so that moving or deleting those files can all be tracked in much the same way as committed files. In addition, git-annex allows for branches and distributed clones of its trees.' I may investigate using this to sync my MP3s instead of SVN
git
git-annex
version-control
december 2011 by jm
gitPAN
june 2010 by jm
CPAN and BackPAN, as a set of git repositories; essentially a read-only view of all CPAN releases, ever. good plan; I like the way git is useful as a kind of general-purpose distributed archive system
git
gitpan
cpan
backpan
perl
releases
archives
history
version-control
from delicious
june 2010 by jm
Code: Flickr Developer Blog » Flipping Out
december 2009 by jm
Flickr don't use branches. mental
branching
integration
branch
version-control
coding
flickr
sysadmin
wtf
deployment
from delicious
december 2009 by jm
related tags
algorithms ⊕ analytics ⊕ android ⊕ aosp ⊕ archival ⊕ archives ⊕ aws ⊕ backpan ⊕ blake2 ⊕ branch ⊕ branching ⊕ bugs ⊕ cd ⊕ ci ⊕ cloud ⊕ coding ⊕ containers ⊕ cpan ⊕ data ⊕ data-science ⊕ database ⊕ deploy ⊕ deployment ⊕ disk ⊕ docker ⊕ dvcs ⊕ error-checking ⊕ etsy ⊕ feature-flags ⊕ flickr ⊕ git ⊕ git-annex ⊕ github ⊕ gitpan ⊕ golang ⊕ google ⊕ heuristics ⊕ history ⊕ integration ⊕ integrity ⊕ kludges ⊕ merge ⊕ merging ⊕ migrations ⊕ monorepi ⊕ monorepo ⊕ omgwtf ⊕ ops ⊕ oss ⊕ pachyderm ⊕ perl ⊕ rahman-algorithm ⊕ refs ⊕ releases ⊕ repo ⊕ s3 ⊕ scaling ⊕ schema ⊕ security ⊕ slides ⊕ source-code-analysis ⊕ static-code-analysis ⊕ storage ⊕ sysadmin ⊕ tags ⊕ tools ⊕ version-control ⊖ wtf ⊕Copy this bookmark: