data-analysis   2107

« earlier    

[1612.07545] A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search
Approximate Nearest Neighbor Search (ANNS) is a fundamental problem in many areas of machine learning and data mining. During the past decade, numerous hashing algorithms are proposed to solve this problem. Every proposed algorithm claims outperform other state-of-the-art hashing methods. However, the evaluation of these hashing papers was not thorough enough, and those claims should be re-examined. The ultimate goal of an ANNS method is returning the most accurate answers (nearest neighbors) in the shortest time. If implemented correctly, almost all the hashing methods will have their performance improved as the code length increases. However, many existing hashing papers only report the performance with the code length shorter than 128. In this paper, we carefully revisit the problem of search with a hash index, and analyze the pros and cons of two popular hash index search procedures. Then we proposed a very simple but effective two level index structures and make a thorough comparison of eleven popular hashing algorithms. Surprisingly, the random-projection-based Locality Sensitive Hashing (LSH) is the best performed algorithm, which is in contradiction to the claims in all the other ten hashing papers. Despite the extreme simplicity of random-projection-based LSH, our results show that the capability of this algorithm has been far underestimated. For the sake of reproducibility, all the codes used in the paper are released on GitHub, which can be used as a testing platform for a fair comparison between various hashing algorithms.
hashing  algorithms  approximation  dimension-reduction  representation  data-analysis  feature-extraction  nudge-targets  consider:looking-to-see  to-write-about 
18 days ago by Vaguery
Are Pop Lyrics Getting More Repetitive?
In 1977, the great computer scientist Donald Knuth published a paper called The Complexity of Songs, which is basically one long joke about the repetitive lyrics of newfangled music (example quote: "the advent of modern drugs has led to demands for still less memory, and the ultimate improvement of Theorem 1 has consequently just been announced").

I'm going to try to test this hypothesis with data. I'll be analyzing the repetitiveness of a dataset of 15,000 songs that charted on the Billboard Hot 100 between 1958 and 2017.
visualization  graphic-design  data-analysis  essay  looking-to-see  javascript  rather-interesting  via:cdzombak 
4 weeks ago by Vaguery
Why Data Is Never Raw - The New Atlantis
curious fact about our data-obsessed era is that we’re often not entirely sure what we even mean by “data”: Elementary particles of knowledge? Digital records? Pure information? Sometimes when we refer to “the data,” we mean the results of an analysis or the evidence concerning a certain question. On other occasions we intend “data” to signify something like “reliable evidence,” as in the saying “The plural of anecdote is not data.”

In everyday usage, the term “data” is associated with a jumble of notions about information, science, and knowledge. Countless reports marvel at the astonishing volumes of data being produced and manipulated, the efficiencies and new opportunities this has made possible, and the myriad ways in which society is changing as a result. We speak of “raw” data and laud it for its independence from human judgment. On this basis, “data-driven” (or “evidence-based”) decision-making is widely endorsed. Yet data’s purported freedom from human subjectivity also seems to allow us to invest it with agency: “Let the data speak for itself,” for “The data doesn’t lie.
data-analysis  bias  politics  argument  data  big-data  read-later 
4 weeks ago by kmt
Peter Turchin Another Clever Proxy for Quantitative History - Peter Turchin
countries did not start growing immediately after the cessation of plague outbreaks. In our book Secular Cycles we discuss the possible reasons for late medieval England and France, and come to the conclusion that the factor that held back populat
history  data-analysis  esoteric 
5 weeks ago by kmt
How many landmarks are enough to characterize shape and size variation?
Accurate characterization of morphological variation is crucial for generating reliable results and conclusions concerning changes and differences in form. Despite the prevalence of landmark-based geometric morphometric (GM) data in the scientific literature, a formal treatment of whether sampled landmarks adequately capture shape variation has remained elusive. Here, I introduce LaSEC (Landmark Sampling Evaluation Curve), a computational tool to assess the fidelity of morphological characterization by landmarks. This task is achieved by calculating how subsampled data converge to the pattern of shape variation in the full dataset as landmark sampling is increased incrementally. While the number of landmarks needed for adequate shape variation is dependent on individual datasets, LaSEC helps the user (1) identify under- and oversampling of landmarks; (2) assess robustness of morphological characterization; and (3) determine the number of landmarks that can be removed without compromising shape information. In practice, this knowledge could reduce time and cost associated with data collection, maintain statistical power in certain analyses, and enable the incorporation of incomplete, but important, specimens to the dataset. Results based on simulated shape data also reveal general properties of landmark data, including statistical consistency where sampling additional landmarks has the tendency to asymptotically improve the accuracy of morphological characterization. As landmark-based GM data become more widely adopted, LaSEC provides a systematic approach to evaluate and refine the collection of shape data––a goal paramount for accumulation and analysis of accurate morphological information.
inference  data-analysis  looking-to-see  rather-interesting  training-data  data-balancing  to-write-about 
6 weeks ago by Vaguery
[1710.00992] DimReader: Axis lines that explain non-linear projections
Non-linear dimensionality reduction (NDR) methods such as LLE and t-SNE are popular with visualization researchers and experienced data analysts, but present serious problems of interpretation. In this paper, we present DimReader, a technique that recovers readable axes from such techniques. DimReader is based on analyzing infinitesimal perturbations of the dataset with respect to variables of interest. The perturbations define exactly how we want to change each point in the original dataset and we measure the effect that these changes have on the projection. The recovered axes are in direct analogy with the axis lines (grid lines) of traditional scatterplots. We also present methods for discovering perturbations on the input data that change the projection the most. The calculation of the perturbations is efficient and easily integrated into programs written in modern programming languages. We present results of DimReader on a variety of NDR methods and datasets both synthetic and real-life, and show how it can be used to compare different NDR methods. Finally, we discuss limitations of our proposal and situations where further research is needed.
user-interface  visualization  dimension-reduction  rather-interesting  data-analysis  explanation  the-mangle-in-practice  to-write-about  to-do 
6 weeks ago by Vaguery

« earlier    

related tags

****  2018  @-public  academic-culture  add-in  advice  ai  algorithms  americana  analysis  analytics  anscombe-quartet  api  approximation  architecture  argument  artificial-intelligence  astronomy  bayesian-inference  bayesian-workflow  beginner  bias  bibliography  big-data  bigdata  biology  book  books  business-intelligence  business  calcio  caret  chicago  class  clever  cli  code  collaboration  computational-geometry  consider:lexicase  consider:looking-to-see  consider:the-mangle  cosma-shalizi  course  courses  cross-validation  csv  d3  data-balancing  data-cleaning  data-mining  data-science  data-vis  data-visualization  data  data_structures  datascience  datasets  datastructures  design  dev  development  digital-humanities  dimension-reduction  dimensionality-reduction  distance  documentation  ecology  economics  emergence  esoteric  essay  etl  excel  explanation  extract  feature-construction  feature-extraction  film  finance  football  gcp  gender  graphic-design  graphics  growth  hashing  haskell  history  hn-comments  howto  human-resource  humanities  import  inference  interview_questions  introduction  java  javascript  jobs  js  json  jupyter-notebook  jupyter-notebooks  jupyter  kernel-density-estimation  learning  linguistics  links  lists  literary-criticism  literate-programming  livestream  looking-to-see  machine-learning  maps  match-analysis  measurement  meta-analysis  metafor  methodology  metrics  microsoft  midlife  migration  modelling  models-and-modes  music-review  music  neoliberalism  nhs-digital  nlp  note  nudge-targets  open-access  open-source  opencontent  oracle  overlapping  package  packages  pandas  parking  patreon  performance-measure  philosophy  pinker  politics  population-density  practical  practice  programming  prototyping  psychology  publishing  python  r-markdown  r-project  r  rather-interesting  read-later  reference  representation  reproducibility  req  research  rstats  rstudio  rust  science  scipy  shape-analysis  skills  soccer  social-network-analysis  social-networks  sociology  software-design  softwareengineering  sources  sparseness  sport  spreadsheets  sql  stan  startup-culture-must-die  statistical-modeling  statistics  stats  tableau  talent  technology  text-mining  the-mangle-in-practice  tickets  tidyverse  tips-and-tricks  to-consider  to-do  to-write-about  tooling  tools  training-data  tutorial  twitch  user-experience  user-interface  visual-programming  visualisation  visualization  voting  workfkflow  writing  yihui.xie 

Copy this bookmark: