jm + bioinformatics   3

"Use trees. Not too deep. Mostly ensembles."
snarky summary of 'Data-driven Advice for Applying Machine Learning to Bioinformatics Problems', a recent analysis paper of ML algorithms
algorithms  machine-learning  bioinformatics  funny  advice  classification 
4 weeks ago by jm
a compressed full-text substring index based on the Burrows-Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,[1] who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for 'Full-text index in Minute space'. It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. Both the query time and storage space requirements are sublinear with respect to the size of the input data.

kragen notes 'gene sequencing is using [them] in production'.
sequencing  bioinformatics  algorithms  bowtie  fm-index  indexing  compression  search  burrows-wheeler  bwt  full-text-search 
march 2014 by jm
'Highly Sensitive Short Read Mapping with MapReduce'. current state of the art in DNA sequence read-mapping algorithms.
CloudBurst uses well-known seed-and-extend algorithms to map reads to a reference genome. It can map reads with any number of differences or mismatches. [..] Given an exact seed, CloudBurst attempts to extend the alignment into an end-to-end alignment with at most k mismatches or differences by either counting mismatches of the two sequences, or with a dynamic programming algorithm to allow for gaps. CloudBurst uses [Hadoop] to catalog and extend the seeds. In the map phase, the map function emits all length-s k-mers from the reference sequences, and all non-overlapping length-s kmers from the reads. In the shuffle phase, read and reference kmers are brought together. In the reduce phase, the seeds are extended into end-to-end alignments. The power of MapReduce and CloudBurst is the map and reduce functions run in parallel over dozens or hundreds of processors.

JM_SOUGHT -- the next generation ;)
bioinformatics  mapreduce  hadoop  read-alignment  dna  sequencing  sought  antispam  algorithms 
july 2012 by jm

Copy this bookmark: