data-analysis   2121

« earlier    

Orange – Data Mining Fruitful & Fun
Open source machine learning and data visualization for novice and expert.
data-mining  data-analysis  machine-learning  python 
8 days ago by HighCharisma
Genome graphs and the evolution of genome inference
The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures—which we collectively refer to as genome graphs—and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.
via:arthegall  review  bioinformatics  clustering  visualization  data-analysis  rather-interesting  consider:nonbiological-genomes 
9 days ago by Vaguery
[1812.05225] Finding the origin of noise transients in LIGO data with machine learning
Quality improvement of interferometric data collected by gravitational-wave detectors such as Advanced LIGO and Virgo is mission critical for the success of gravitational-wave astrophysics. Gravitational-wave detectors are sensitive to a variety of disturbances of non-astrophysical origin with characteristic frequencies in the instrument band of sensitivity. Removing non-astrophysical artifacts that corrupt the data stream is crucial for increasing the number and statistical significance of gravitational-wave detections and enabling refined astrophysical interpretations of the data. Machine learning has proved to be a powerful tool for analysis of massive quantities of complex data in astronomy and related fields of study. We present two machine learning methods, based on random forest and genetic programming algorithms, that can be used to determine the origin of non-astrophysical transients in the LIGO detectors. We use two classes of transients with known instrumental origin that were identified during the first observing run of Advanced LIGO to show that the algorithms can successfully identify the origin of non-astrophysical transients in real interferometric data and thus assist in the mitigation of instrumental and environmental disturbances in gravitational-wave searches. While the data sets described in this paper are specific to LIGO, and the exact procedures employed were unique to the same, the random forest and genetic programming code bases and means by which they were applied as a dual machine learning approach are completely portable to any number of instruments in which noise is believed to be generated through mechanical couplings, the source of which is not yet discovered.
genetic-programming  hey-I-know-this-guy  astrophysics  data-analysis  data-mining  to-understand  feature-construction  classification 
26 days ago by Vaguery
[1612.07545] A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search
Approximate Nearest Neighbor Search (ANNS) is a fundamental problem in many areas of machine learning and data mining. During the past decade, numerous hashing algorithms are proposed to solve this problem. Every proposed algorithm claims outperform other state-of-the-art hashing methods. However, the evaluation of these hashing papers was not thorough enough, and those claims should be re-examined. The ultimate goal of an ANNS method is returning the most accurate answers (nearest neighbors) in the shortest time. If implemented correctly, almost all the hashing methods will have their performance improved as the code length increases. However, many existing hashing papers only report the performance with the code length shorter than 128. In this paper, we carefully revisit the problem of search with a hash index, and analyze the pros and cons of two popular hash index search procedures. Then we proposed a very simple but effective two level index structures and make a thorough comparison of eleven popular hashing algorithms. Surprisingly, the random-projection-based Locality Sensitive Hashing (LSH) is the best performed algorithm, which is in contradiction to the claims in all the other ten hashing papers. Despite the extreme simplicity of random-projection-based LSH, our results show that the capability of this algorithm has been far underestimated. For the sake of reproducibility, all the codes used in the paper are released on GitHub, which can be used as a testing platform for a fair comparison between various hashing algorithms.
hashing  algorithms  approximation  dimension-reduction  representation  data-analysis  feature-extraction  nudge-targets  consider:looking-to-see  to-write-about 
6 weeks ago by Vaguery

« earlier    

related tags

****  *  2018  academic-culture  add-in  advice  ai  algorithms  americana  analysis  analytics  anscombe-quartet  api  approximation  architecture  argument  artificial-intelligence  astronomy  astrophysics  bayesian-inference  bayesian-workflow  beginner  belgium  bias  bibliography  big-data  bigdata  bioinformatics  biology  biostatistics  book  books  business-analytics  business-intelligence  business  calcio  calibration  caret  chicago  class  classification  clever  cli  clustering  code  collaboration  computational-geometry  consider:lexicase  consider:looking-to-see  consider:nonbiological-genomes  consider:the-mangle  cosma-shalizi  courses  cross-validation  csv  data-balancing  data-cleaning  data-mining  data-science  data-vis  data  data_structures  datascience  datastructures  design  dev  development  digital-humanities  dimension-reduction  dimensionality-reduction  distance  documentation  ecology  economics  emergence  esoteric  essay  etl  excel  explanation  extract  feature-construction  feature-extraction  film  finance  football  gcp  gender  genetic-programming  geography  gis  graphic-design  graphics  hashing  haskell  hey-i-know-this-guy  history  hn-comments  howto  human-resource  humanities  import  inequality  inference  interview_questions  java  javascript  jobs  json  jupyter-notebook  jupyter-notebooks  jupyter  kernel-density-estimation  lecture  linguistics  links  lists  literary-criticism  literate-programming  looking-to-see  machine-learning  mapping  maps  match-analysis  measurement  meta-analysis  metafor  methodology  metrics  microsoft  midlife  migration  modelling  models-and-modes  multivariate  music-review  music  neoliberalism  nhs-digital  nlp  note  nudge-targets  open-access  open-source  opencontent  oracle  overlapping  package  pandas  parking  pca  people  performance-measure  philosophy  pinker  politics  population-density  practice  programming  prototyping  psychology  publishing  python  qgis  r-markdown  r-project  r  rather-interesting  read-later  reference  representation  reproducibility  req  research  researchers  review  rstats  rstudio  rust  science  scipy  shape-analysis  skills  soccer  social-network-analysis  social-networks  sociology  software-design  softwareengineering  sources  sparseness  specialization  sport  spreadsheets  sql  stan  startup-culture-must-die  stat  statistical-modeling  statistics  stats  tableau  talent  technology  text-mining  the-mangle-in-practice  tickets  tidyverse  tips-and-tricks  to-consider  to-do  to-understand  to-write-about  tooling  tools  training-data  tutorial  uncertainty  us  user-experience  user-interface  video  visual-programming  visualisation  visualization  voting  workfkflow  writing  yihui.xie 

Copy this bookmark: