jm + graph   6

Are you better off running your big-data batch system off your laptop?
Heh, nice trolling.
Here are two helpful guidelines (for largely disjoint populations):

If you are going to use a big data system for yourself, see if it is faster than your laptop.
If you are going to build a big data system for others, see that it is faster than my laptop. [...]

We think everyone should have to do this, because it leads to better systems and better research.
graph  coding  hadoop  spark  giraph  graph-processing  hardware  scalability  big-data  batch  algorithms  pagerank 
january 2015 by jm
LinkBench: A database benchmark for the social graph
However, the gold standard for database benchmarking is to test the performance of a system on the real production workload, since synthetic benchmarks often don't exercise systems in the same way. When making decisions about a significant component of Facebook's infrastructure, we need to understand how a database system will really perform in Facebook's production workload. [....] LinkBench addresses these needs by replicating the data model, graph structure, and request mix of our MySQL social graph workload.

Mentioned in a presentation from Peter Bailis,
graph  databases  mysql  facebook  performance  testing  benchmarks  workloads 
october 2013 by jm
Pinterest's follower graph store, built on Redis
This is a good, high-availability Redis configuration; sharded by userid across 8192 shards, with a Redis master/slave pair of instances for each set of N shards. I like their use of two redundancy systems -- hot slave and backup snapshots:
We run our cluster in a Redis master-slave configuration, and the slaves act as hot backups. Upon a master failure, we failover the slave as the new master and either bring up a new slave or reuse the old master as the new slave. We rely on ZooKeeper to make this as quick as possible.

Each master Redis instance (and slave instance) is configured to write to AOF on Amazon EBS. This ensures that if the Redis instances terminate unexpectedly then the loss of data is limited to 1 second of updates. The slave Redis instances also perform BGsave hourly which is then loaded to a more permanent store (Amazon S3). This copy is also used by Map Reduce jobs for analytics.

As a production system, we need many failure modes to guard ourselves. As mentioned, if the master host is down, we will manually failover to slave. If a single master Redis instance reboots, monit restart restores from AOF, implying a 1 second window of data loss on the shards on that instance. If the slave host goes down, we bring up a replacement. If a single slave Redis instance goes down, we rely on monit to restart using the AOF data. Because we may encounter AOF or BGsave file corruption, we BGSave and copy hourly backups to S3. Note that large file sizes can cause BGsave induced delays but in our cluster this is mitigated by smaller Redis data due to the sharding scheme.
graph  redis  architecture  ha  high-availability  design  redundancy  sharding 
july 2013 by jm
"Finding the k shortest paths", D. Eppstein, 1994 [paper]
35th IEEE Symp. Foundations of Comp. Sci., Santa Fe, 1994, pp. 154-165.

This paper presents an algorithm that finds multiple short paths connecting two terminals in a graph (allowing repeated vertices and edges in the paths) in constant time per path after a preprocessing stage dominated by a single-source shortest path computation. The paths it finds are the k shortest in the graph, where k is a parameter given as input to the algorithm.

Time complexity: O(E*log V + L*k*log k) time (L is path length)
k-shortest-paths  graph  algorithms  sparse-graphs 
june 2012 by jm
PeteSearch: How to split up the US
wow. fascinating results from social-network cluster analysis of Facebook, splitting up the entire USA into 7 clusters
clusters  facebook  data  statistics  maps  culture  analytics  datamining  demographics  socialnetworking  graph  dataviz  from delicious
february 2010 by jm

Copy this bookmark: