[1712.05630] Sparse principal component analysis via random projections
We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully-selected random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so is not vulnerable to a bad choice of initialisation. Our theory provides great detail on the statistical and computational trade-off in our procedure, revealing a subtle interplay between the effective sample size and the number of random projections that are required to achieve the minimax optimal rate. Numerical studies provide further insight into the procedure and confirm its highly competitive finite-sample performance.
dimension-reduction  statistics  data-analysis  algorithms  performance-measure  consider:lexicase  sparseness 
11 days ago by Vaguery
Oracle Labs PGX: Parallel Graph AnalytiX Overview
PGX is a toolkit for graph analysis - both running algorithms such as PageRank against graphs, and performing SQL-like pattern-matching against graphs, using the results of algorithmic analysis.  Algorithms are parallelized for extreme performance. The PGX toolkit includes both a single-node in-memory engine, and a distributed engine for extremely large graphs. Graphs can be loaded from a variety of sources including flat files, SQL and NoSQL databases and Apache Spark and Hadoop; incremental updates are supported.
data-analysis  big-data  esoteric  SQL  oracle 
13 days ago by kmt
API and JSON in R
Tutorial from Paul Bradshaw
R  Rstudio  data-analysis  JSON  API 
18 days ago by wanulfa
Introduction to data cleaning using Pandas
I’ve been using Excel for data cleaning until I discovered how powerful pandas are for data analysis and data cleaning. In this article I want to go over basics of how to use pandas for cleaning data in excel files.
Pandas  data  data-cleaning  data-analysis  python 
19 days ago by wanulfa
R, RStudio, and the tidyverse for data analysis
A tutorial in using R for data analysis, very useful for journalist
R  Rstudio  data-analysis 
19 days ago by wanulfa
A non-spatial account of place and grid cells based on clustering models of concept learning | bioRxiv
One view is that conceptual knowledge is organized as a "cognitive map" in the brain, using the circuitry in the medial temporal lobe (MTL) that supports spatial navigation. In contrast, we find that a domain-general learning algorithm explains key findings in both spatial and conceptual domains. When the clustering model is applied to spatial navigation tasks, so called place and grid cells emerge because of the relatively uniform sampling of possible inputs in these tasks. The same mechanism applied to conceptual tasks, where the overall space can be higher-dimensional and sampling sparser, leads to representations more aligned with human conceptual knowledge. Although the types of memory supported by the MTL are superficially dissimilar, the information processing steps appear shared.
models-and-modes  emergence  data-analysis  rather-interesting  to-write-about  consider:the-mangle 
25 days ago by Vaguery
The Dictatorship of Data - MIT Technology Review
McNamara was a numbers guy. Appointed the U.S. secretary of defense when tensions in Vietnam rose in the early 1960s, he insisted on getting data on everything he could. Only by applying statistical rigor, he believed, could decision makers understand a complex situation and make the right choices. The world in his view was a mass of unruly information that—if delineated, denoted, demarcated, and quantified—could be tamed by human hand and fall under human will. McNamara sought Truth, and that Truth could be found in data. Among the numbers that came back to him was the “body count.”
oh yeah, this is classic:
"McNamara rose swiftly up the ranks, trotting out a data point for every situation. Harried factory managers produced the figures he demanded—whether they were correct or not. When an edict came down that all inventory from one car model must be used before a new model could begin production, exasperated line managers simply dumped excess parts into a nearby river. The joke at the factory was that a fellow could walk on water—atop rusted pieces of 1950 and 1951 cars."
big-data  data-analysis  stats  bias  history  methodology  argument 
4 weeks ago by kmt
Jupyter notebooks as Markdown documents, Julia, Python or R scripts. Supports round-trip conversion.
jupyter  data-analysis  workfkflow  collaboration 
5 weeks ago by mjlassila

