jm + sorting   13

Sorting out graph processing
Some nice real-world experimentation around large-scale data processing in differential dataflow:
If you wanted to do an iterative graph computation like PageRank, it would literally be faster to sort the edges from scratch each and every iteration, than to use unsorted edges. If you want to do graph computation, please sort your edges.

Actually, you know what: if you want to do any big data computation, please sort your records. Stop talking sass about how Hadoop sorts things it doesn't need to, read some papers, run some tests, and then sort your damned data. Or at least run faster than me when I sort your data for you.
algorithms  graphs  coding  data-processing  big-data  differential-dataflow  radix-sort  sorting  x-stream  counting-sort  pagerank 
august 2015 by jm
Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it)
Wow, this is excellent work. A formal verification of Tim Peters' TimSort failed, resulting in a bugfix:
While attempting to verify TimSort, we failed to establish its instance invariant. Analysing the reason, we discovered a bug in TimSort’s implementation leading to an ArrayOutOfBoundsException for certain inputs. We suggested a proper fix for the culprit method (without losing measurable performance) and we have formally proven that the fix actually is correct and that this bug no longer persists.
timsort  algorithms  android  java  python  sorting  formal-methods  proofs  openjdk 
february 2015 by jm
Spark 1.2 released
This is the version with the superfast petabyte-sort record:
Spark 1.2 includes several cross-cutting optimizations focused on performance for large scale workloads. Two new features Databricks developed for our world record petabyte sort with Spark are turned on by default in Spark 1.2. The first is a re-architected network transfer subsystem that exploits Netty 4’s zero-copy IO and off heap buffer management. The second is Spark’s sort based shuffle implementation, which we’ve now made the default after significant testing in Spark 1.1. Together, we’ve seen these features give as much as 5X performance improvement for workloads with very large shuffles.
spark  sorting  hadoop  map-reduce  batch  databricks  apache  netty 
december 2014 by jm
A Linear-Time, One-Pass Majority Vote Algorithm
This algorithm, which Bob Boyer and I invented in 1980, decides which element of a sequence is in the majority, provided there is such an element.
algorithms  one-pass  o(1)  coding  majority  top-k  sorting 
september 2014 by jm
Frogsort as an exam question (via qwghlm)
via:qwghlm  frogsort  sorting  big-o  algorithms  funny  comics  smbc 
august 2014 by jm
Beautiful algorithm visualisations from Mike Bostock
This is a few days old, but unmissable. I swear, the 'Wilson's algorithm transformed into a tidy tree layout' viz brought tears to my eyes ;)
dataviz  algorithms  visualization  visualisation  mazes  trees  sorting  animation  mike-bostock 
june 2014 by jm
Faster BAM Sorting with SAMtools and RocksDB
Now this is really really clever. Heap-merging a heavyweight genomics format, using RocksDB to speed it up.
There’s a problem with the single-pass merge described above when the number of intermediate files, N/R, is large. Merging the sorted intermediate files in limited memory requires constantly reading little bits from all those files, incurring a lot of disk seeks on rotating drives. In fact, at some point, samtools sort performance becomes effectively bound to disk seeking. [...] In this scenario, samtools rocksort can sort the same data in much less time, using no more memory, by invoking RocksDB’s background compaction capabilities. With a few extra lines of code we configure RocksDB so that, while we’re still in the process of loading the BAM data, it runs additional background threads to merge batches of existing sorted temporary files into fewer, larger, sorted files. Just like the final merge, each background compaction requires only a modest amount of working memory.

(via the RocksDB facebook group)
rocksdb  algorithms  sorting  leveldb  bam  samtools  merging  heaps  compaction 
may 2014 by jm
stereopsis : graphics : radix tricks
some nice super-optimized Radix Sort code which handles floating point values. See also for more info on the histogramming/counter concept
sorting  programming  coding  algorithms  radix-sort  optimization  floating-point 
december 2013 by jm
Lectures in Advanced Data Structures (6.851)
Good lecture notes on the current state of the art in data structure research.
Data structures play a central role in modern computer science. You interact with data structures even more often than with algorithms (think Google, your mail server, and even your network routers). In addition, data structures are essential building blocks in obtaining efficient algorithms. This course covers major results and current directions of research in data structures:

TIME TRAVEL We can remember the past efficiently (a technique called persistence), but in general it's difficult to change the past and see the outcomes on the present (retroactivity). So alas, Back To The Future isn't really possible.
GEOMETRY When data has more than one dimension (e.g. maps, database tables).
DYNAMIC OPTIMALITY Is there one binary search tree that's as good as all others? We still don't know, but we're close.
MEMORY HIERARCHY Real computers have multiple levels of caches. We can optimize the number of cache misses, often without even knowing the size of the cache.
HASHING Hashing is the most used data structure in computer science. And it's still an active area of research.
INTEGERS Logarithmic time is too easy. By careful analysis of the information you're dealing with, you can often reduce the operation times substantially, sometimes even to constant. We will also cover lower bounds that illustrate when this is not possible.
DYNAMIC GRAPHS A network link went down, or you just added or deleted a friend in a social network. We can still maintain essential information about the connectivity as it changes.
STRINGS Searching for phrases in giant text (think Google or DNA).
SUCCINCT Most “linear size” data structures you know are much larger than they need to be, often by an order of magnitude. Some data structures require almost no space beyond the raw data but are still fast (think heaps, but much cooler).

(via Tim Freeman)
data-structures  lectures  mit  video  data  algorithms  coding  csail  strings  integers  hashing  sorting  bst  memory 
april 2013 by jm
Reddit’s ranking algorithms
so Reddit uses the Wilson score confidence interval approach, it turns out; more details here (via Toby diPasquale)
ranking  rating  algorithms  popularity  python  wilson-score-interval  sorting  statistics  confidence-sort 
january 2013 by jm
Introduction to parallel & distributed algorithms
really interesting parallel algorithm concepts. I'd seen parallel merge sort before from the map-reduce world, but some others are new to me and worth thinking about (via Hacker News)
via:hackernews  algorithms  distributed  parallel  map-reduce  merge-sort  sorting  from delicious
august 2010 by jm
jwz - What different sorting algorithms sound like
in the style of BBC Radiophonics Workshop, with copious flange -- my favourite is heap sort. this is brilliant (via jwz)
via:jwz  sound  music  sorting  algorithms  from delicious
august 2010 by jm

Copy this bookmark: