cshalizi + to_teach:data-mining   383

 « earlier
[1909.03681] Outlier Detection in High Dimensional Data
"High-dimensional data poses unique challenges in outlier detection process. Most of the existing algorithms fail to properly address the issues stemming from a large number of features. In particular, outlier detection algorithms perform poorly on data set of small size with a large number of features. In this paper, we propose a novel outlier detection algorithm based on principal component analysis and kernel density estimation. The proposed method is designed to address the challenges of dealing with high-dimensional data by projecting the original data onto a smaller space and using the innate structure of the data to calculate anomaly scores for each data point. Numerical experiments on synthetic and real-life data show that our method performs well on high-dimensional data. In particular, the proposed method outperforms the benchmark methods as measured by the F1-score. Our method also produces better-than-average execution times compared to the benchmark methods."

--- Seems OK but ad hoc. Might make a decent extension to the eigendresses assignment for data mining.
to:NB  anomaly_detection  density_estimation  principal_components  high-dimensional_statistics  statistics  to_teach:data-mining
2 days ago by cshalizi
[1908.09946] An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations
"K-Means is one of the most used algorithms for data clustering and the usual clustering method for benchmarking. Despite its wide application it is well-known that it suffers from a series of disadvantages, such as the positions of the initial clustering centres (centroids), which can greatly affect the clustering solution. Over the years many K-Means variations and initialisations techniques have been proposed with different degrees of complexity. In this study we focus on common K-Means variations and deterministic initialisation techniques and we first show that more sophisticated initialisation methods reduce or alleviates the need of complex K-Means clustering, and secondly, that deterministic methods can achieve equivalent or better performance than stochastic methods. These conclusions are obtained through extensive benchmarking using different model data sets from various studies as well as clustering data sets."
to:NB  clustering  k-means  data_mining  to_teach:data-mining
2 days ago by cshalizi
[1909.05495] Optimal choice of $k$ for $k$-nearest neighbor regression
"The k-nearest neighbor algorithm (k-NN) is a widely used non-parametric method for classification and regression. We study the mean squared error of the k-NN estimator when k is chosen by leave-one-out cross-validation (LOOCV). Although it was known that this choice of k is asymptotically consistent, it was not known previously that it is an optimal k. We show, with high probability, the mean squared error of this estimator is close to the minimum mean squared error using the k-NN estimate, where the minimum is over all choices of k."

--- Looks legit on first pass (and we know that LOOCV is generally _predictively_ good).
to:NB  regression  nearest_neighbors  statistics  cross-validation  to_teach:data-mining  have_skimmed
4 days ago by cshalizi
[1909.04436] The Prevalence of Errors in Machine Learning Experiments
"Context: Conducting experiments is central to research machine learning research to benchmark, evaluate and compare learning algorithms. Consequently it is important we conduct reliable, trustworthy experiments. Objective: We investigate the incidence of errors in a sample of machine learning experiments in the domain of software defect prediction. Our focus is simple arithmetical and statistical errors. Method: We analyse 49 papers describing 2456 individual experimental results from a previously undertaken systematic review comparing supervised and unsupervised defect prediction classifiers. We extract the confusion matrices and test for relevant constraints, e.g., the marginal probabilities must sum to one. We also check for multiple statistical significance testing errors. Results: We find that a total of 22 out of 49 papers contain demonstrable errors. Of these 7 were statistical and 16 related to confusion matrix inconsistency (one paper contained both classes of error). Conclusions: Whilst some errors may be of a relatively trivial nature, e.g., transcription errors their presence does not engender confidence. We strongly urge researchers to follow open science principles so errors can be more easily be detected and corrected, thus as a community reduce this worryingly high error rate with our computational experiments."
5 days ago by cshalizi
[1909.03093] Solving Interpretable Kernel Dimension Reduction
"Kernel dimensionality reduction (KDR) algorithms find a low dimensional representation of the original data by optimizing kernel dependency measures that are capable of capturing nonlinear relationships. The standard strategy is to first map the data into a high dimensional feature space using kernels prior to a projection onto a low dimensional space. While KDR methods can be easily solved by keeping the most dominant eigenvectors of the kernel matrix, its features are no longer easy to interpret. Alternatively, Interpretable KDR (IKDR) is different in that it projects onto a subspace \textit{before} the kernel feature mapping, therefore, the projection matrix can indicate how the original features linearly combine to form the new features. Unfortunately, the IKDR objective requires a non-convex manifold optimization that is difficult to solve and can no longer be solved by eigendecomposition. Recently, an efficient iterative spectral (eigendecomposition) method (ISM) has been proposed for this objective in the context of alternative clustering. However, ISM only provides theoretical guarantees for the Gaussian kernel. This greatly constrains ISM's usage since any kernel method using ISM is now limited to a single kernel. This work extends the theoretical guarantees of ISM to an entire family of kernels, thereby empowering ISM to solve any kernel method of the same objective. In identifying this family, we prove that each kernel within the family has a surrogate Φ matrix and the optimal projection is formed by its most dominant eigenvectors. With this extension, we establish how a wide range of IKDR applications across different learning paradigms can be solved by ISM. To support reproducible results, the source code is made publicly available on \url{this https URL}."

--- Last tag is dreamily aspirational.
to:NB  kernel_methods  dimension_reduction  principal_components  to_teach:data-mining
6 days ago by cshalizi
[1808.08619] Discriminative but Not Discriminatory: A Comparison of Fairness Definitions under Different Worldviews
"We mathematically compare three competing definitions of group-level nondiscrimination: demographic parity, equalized odds, and calibration. Using the theoretical framework of Friedler et al., we study the properties of each definition under various worldviews, which are assumptions about how, if at all, the observed data is biased. We argue that different worldviews call for different definitions of fairness, and we specify the worldviews that, when combined with the desire to avoid a criterion for discrimination that we call disparity amplification, motivate demographic parity and equalized odds. In addition, we show that calibration is insufficient for avoiding disparity amplification because it allows an arbitrarily large inter-group disparity. Finally, we define a worldview that is more realistic than the previously considered ones, and we introduce a new notion of fairness that corresponds to this worldview."
to:NB  prediction  data_mining  algorithmic_fairness  to_be_shot_after_a_fair_trial  to_teach:data-mining
8 days ago by cshalizi
PsyArXiv Preprints | A “Need for Chaos” and the Sharing of Hostile Political Rumors in Advanced Democracies
"The circulation of hostile political rumors (including but not limited to false news and conspiracy theories) has gained prominence in public debates across advanced democracies. Here, we provide the first comprehensive assessment of the psychological syndrome that elicits motivations to share hostile political rumors among citizens of democratic societies. Against the notion that sharing occurs to help one mainstream political actor in the increasingly polarized electoral competition against other mainstream actors, we demonstrate that sharing motivations are associated with ‘chaotic’ motivations to “burn down” the entire established democratic ‘cosmos’. We show that this extreme discontent is associated with motivations to share hostile political rumors, not because such rumors are viewed to be true but because they are believed to mobilize the audience against disliked elites. We introduce an individual difference measure, the “Need for Chaos”, to measure these motivations and illuminate their social causes, linked to frustrated status-seeking. Finally, we show that chaotic motivations are surprisingly widespread within advanced democracies, having some hold in up to 40 percent of the American national population."
to:NB  to_be_shot_after_a_fair_trial  us_politics  principal_components  psychometrics  #include:my_usual_skepticism_about_this_kind_of_psychometrics  to_teach:data-mining  epidemiology_of_representations  social_media  natural_history_of_truthiness  re:actually-dr-internet-is-the-name-of-the-monsters-creator
10 days ago by cshalizi
The Ethical Algorithm - Michael Kearns; Aaron Roth - Oxford University Press
"Over the course of a generation, algorithms have gone from mathematical abstractions to powerful mediators of daily life. Algorithms have made our lives more efficient, more entertaining, and, sometimes, better informed. At the same time, complex algorithms are increasingly violating the basic rights of individual citizens. Allegedly anonymized datasets routinely leak our most sensitive personal information; statistical models for everything from mortgages to college admissions reflect racial and gender bias. Meanwhile, users manipulate algorithms to "game" search engines, spam filters, online reviewing services, and navigation apps.
"Understanding and improving the science behind the algorithms that run our lives is rapidly becoming one of the most pressing issues of this century. Traditional fixes, such as laws, regulations and watchdog groups, have proven woefully inadequate. Reporting from the cutting edge of scientific research, The Ethical Algorithm offers a new approach: a set of principled solutions based on the emerging and exciting science of socially aware algorithm design. Michael Kearns and Aaron Roth explain how we can better embed human principles into machine code - without halting the advance of data-driven scientific exploration. Weaving together innovative research with stories of citizens, scientists, and activists on the front lines, The Ethical Algorithm offers a compelling vision for a future, one in which we can better protect humans from the unintended impacts of algorithms while continuing to inspire wondrous advances in technology."
to:NB  books:noted  via:arsyed  data_mining  algorithmic_fairness  kearns.michael  to_teach:data-mining
14 days ago by cshalizi
7 Things Netflix’s ‘The Great Hack’ Gets Wrong About the Facebook–Cambridge Analytica Data Scandal - Truth on the Market Truth on the Market
(Tangentially: if the effects of campaign advertising are 0 on average, why do campaigns spend so much on it? Alternately, how do those studies rule out the possibility that advertising with little or no opposition would be very effective, but it's always opposed?)
to_teach:data-mining  cambridge_analytica  debunking  track_down_references
19 days ago by cshalizi
[1908.09635] A Survey on Bias and Fairness in Machine Learning
"With the widespread use of AI systems and applications in our everyday lives, it is important to take fairness issues into consideration while designing and engineering these types of systems. Such systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that the decisions do not reflect discriminatory behavior toward certain groups or populations. We have recently seen work in machine learning, natural language processing, and deep learning that addresses such challenges in different subdomains. With the commercialization of these systems, researchers are becoming aware of the biases that these applications can contain and have attempted to address them. In this survey we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined in order to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and how they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields."
to:NB  to_read  algorithmic_fairness  prediction  machine_learning  lerman.kristina  galstyan.aram  to_teach:data-mining
21 days ago by cshalizi
[1908.08328] Measuring the Business Value of Recommender Systems
"Recommender Systems are nowadays successfully used by all major web sites (from e-commerce to social media) to filter content and make suggestions in a personalized way. Academic research largely focuses on the value of recommenders for consumers, e.g., in terms of reduced information overload. To what extent and in which ways recommender systems create business value is, however, much less clear, and the literature on the topic is scattered. In this research commentary, we review existing publications on field tests of recommender systems and report which business-related performance measures were used in such real-world deployments. We summarize common challenges of measuring the business value in practice and critically discuss the value of algorithmic improvements and offline experiments as commonly done in academic environments. Overall, our review indicates that various open questions remain both regarding the realistic quantification of the business effects of recommenders and the performance assessment of recommendation algorithms in academia."
to:NB  recommender_systems  to_teach:data-mining
21 days ago by cshalizi
Evaluating Probabilistic Forecasts with scoringRules | Jordan | Journal of Statistical Software
"Probabilistic forecasts in the form of probability distributions over future events have become popular in several fields including meteorology, hydrology, economics, and demography. In typical applications, many alternative statistical models and data sources can be used to produce probabilistic forecasts. Hence, evaluating and selecting among competing methods is an important task. The scoringRules package for R provides functionality for comparative evaluation of probabilistic models based on proper scoring rules, covering a wide range of situations in applied work. This paper discusses implementation and usage details, presents case studies from meteorology and economics, and points to the relevant background literature."
27 days ago by cshalizi
[1908.07031] Evaluating Hierarchies through A Partially Observable Markov Decision Processes Methodology
"Hierarchical clustering has been shown to be valuable in many scenarios, e.g. catalogues, biology research, image processing, and so on. Despite its usefulness to many situations, there is no agreed methodology on how to properly evaluate the hierarchies produced from different techniques, particularly in the case where ground-truth labels are unavailable. This motivates us to propose a framework for assessing the quality of hierarchical clustering allocations which covers the case of no ground-truth information. Such a quality measurement is useful, for example, to assess the hierarchical structures used by online retailer websites to display their product catalogues. Differently to all the previous measures and metrics, our framework tackles the evaluation from a decision theoretic perspective. We model the process as a bot searching stochastically for items in the hierarchy and establish a measure representing the degree to which the hierarchy supports this search. We employ the concept of Partially Observable Markov Decision Processes (POMDP) to model the uncertainty, the decision making, and the cognitive return for searchers in such a scenario. In this paper, we fully discuss the modeling details and demonstrate its application on some datasets."
to:NB  clustering  hierarchical_structure  information_retrieval  to_teach:data-mining
27 days ago by cshalizi
The Incompatible Incentives of Private Sector AI by Tom Slee :: SSRN
"Algorithms that sort people into categories are plagued by incompatible incentives. While more accurate algorithms may address problems of statistical bias and unfairness, they cannot solve the ethical challenges that arise from incompatible incentives.
"Subjects of algorithmic decisions seek to optimize their outcomes, but such efforts may degrade the accuracy of the algorithm. To maintain their accuracy, algorithms must be accompanied by supplementary rules: “guardrails” that dictate the limits of acceptable behaviour by subjects. Algorithm owners are drawn into taking on the tasks of governance, managing and validating the behaviour of those who interact with their systems.
"The governance role offers temptations to indulge in regulatory arbitrage. If governance is left to algorithm owners, it may lead to arbitrary and restrictive controls on individual behaviour. The goal of algorithmic governance by automated decision systems, social media recommender systems, and rating systems is a mirage, retreating into the distance whenever we seem to approach it."
to:NB  mechanism_design  prediction  data_mining  slee.tom  to_read  to_teach:data-mining
28 days ago by cshalizi
Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition | Dermatology | JAMA Dermatology | JAMA Network
"Question  Are surgical skin markings in dermoscopic images associated with the diagnostic performance of a trained and validated deep learning convolutional neural network?
"Findings  In this cross-sectional study of 130 skin lesions, skin markings by standard surgical ink markers were associated with a significant reduction in the specificity of a convolutional neural network by increasing the melanoma probability scores, consequently increasing the false-positive rate of benign nevi by approximately 40%.
"Meaning  This study suggests that the use of surgical skin markers should be avoided in dermoscopic images intended for analysis by a convolutional neural network."
to:NB  classifiers  to_teach:data-mining  via:tslumley
28 days ago by cshalizi
[1908.06852] SIRUS: making random forests interpretable
"State-of-the-art learning algorithms, such as random forests or neural networks, are often qualified as "black-boxes" because of the high number and complexity of operations involved in their prediction mechanism. This lack of interpretability is a strong limitation for applications involving critical decisions, typically the analysis of production processes in the manufacturing industry. In such critical contexts, models have to be interpretable, i.e., simple, stable, and predictive. To address this issue, we design SIRUS (Stable and In-terpretable RUle Set), a new classification algorithm based on random forests, which takes the form of a short list of rules. While simple models are usually unstable with respect to data perturbation, SIRUS achieves a remarkable stability improvement over cutting-edge methods. Furthermore, SIRUS inherits a predictive accuracy close to random forests, combined with the simplicity of decision trees. These properties are assessed both from a theoretical and empirical point of view, through extensive numerical experiments based on our R/C++ software implementation sirus."
to:NB  classifiers  ensemble_methods  random_forests  decision_trees  data_mining  statistics  to_teach:data-mining
28 days ago by cshalizi
[1908.06319] Locally Linear Embedding and fMRI feature selection in psychiatric classification
"Background: Functional magnetic resonance imaging (fMRI) provides non-invasive measures of neuronal activity using an endogenous Blood Oxygenation-Level Dependent (BOLD) contrast. This article introduces a nonlinear dimensionality reduction (Locally Linear Embedding) to extract informative measures of the underlying neuronal activity from BOLD time-series. The method is validated using the Leave-One-Out-Cross-Validation (LOOCV) accuracy of classifying psychiatric diagnoses using resting-state and task-related fMRI. Methods: Locally Linear Embedding of BOLD time-series (into each voxel's respective tensor) was used to optimise feature selection. This uses Gauß' Principle of Least Constraint to conserve quantities over both space and time. This conservation was assessed using LOOCV to greedily select time points in an incremental fashion on training data that was categorised in terms of psychiatric diagnoses. Findings: The embedded fMRI gave highly diagnostic performances (> 80%) on eleven publicly-available datasets containing healthy controls and patients with either Schizophrenia, Attention-Deficit Hyperactivity Disorder (ADHD), or Autism Spectrum Disorder (ASD). Furthermore, unlike the original fMRI data before or after using Principal Component Analysis (PCA) for artefact reduction, the embedded fMRI furnished significantly better than chance classification (defined as the majority class proportion) on ten of eleven datasets. Interpretation: Locally Linear Embedding appears to be a useful feature extraction procedure that retains important information about patterns of brain activity distinguishing among psychiatric cohorts."

--- Last tag is because I plan to teach LLE and this might make a good example or assignment, if I like how it was actually done.
to:NB  locally_linear_embedding  classifiers  fmri  dimension_reduction  to_teach:data-mining
28 days ago by cshalizi
[1908.06173] The History of Digital Spam
"Spam!: that's what Lorrie Faith Cranor and Brian LaMacchia exclaimed in the title of a popular call-to-action article that appeared twenty years ago on Communications of the ACM. And yet, despite the tremendous efforts of the research community over the last two decades to mitigate this problem, the sense of urgency remains unchanged, as emerging technologies have brought new dangerous forms of digital spam under the spotlight. Furthermore, when spam is carried out with the intent to deceive or influence at scale, it can alter the very fabric of society and our behavior. In this article, I will briefly review the history of digital spam: starting from its quintessential incarnation, spam emails, to modern-days forms of spam affecting the Web and social media, the survey will close by depicting future risks associated with spam and abuse of new technologies, including Artificial Intelligence (e.g., Digital Humans). After providing a taxonomy of spam, and its most popular applications emerged throughout the last two decades, I will review technological and regulatory approaches proposed in the literature, and suggest some possible solutions to tackle this ubiquitous digital epidemic moving forward."
to:NB  spam  advertising  deceiving_us_has_become_an_industrial_process  history_of_technology  history_of_computing  networked_life  to_teach:data-mining
28 days ago by cshalizi
[1908.05818] Kernel Sketching yields Kernel JL
"The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix K yields an operator whose counterpart in an RKHS , is a random projection operator---in the spirit of Johnson-Lindenstrauss (JL) lemma. To be precise, given a random matrix Z with i.i.d. Gaussian entries, we show that a sketch ZK corresponds to a particular random operator in (infinite-dimensional) Hilbert space  that maps functions f∈ to a low-dimensional space ℝd, while preserving a weighted RKHS inner-product of the form ⟨f,g⟩Σ≐⟨f,Σ3g⟩, where Σ is the \emph{covariance} operator induced by the data distribution. In particular, under similar assumptions as in kernel PCA (KPCA), or kernel k-means (K-k-means), well-separated subsets of feature-space {K(⋅,x):x∈} remain well-separated after such operation, which suggests similar benefits as in KPCA and/or K-k-means, albeit at the much cheaper cost of a random projection. In particular, our convergence rates suggest that, given a large dataset {Xi}Ni=1 of size N, we can build the Gram matrix K on a much smaller subsample of size n≪N, so that the sketch ZK is very cheap to obtain and subsequently apply as a projection operator on the original data {Xi}Ni=1. We verify these insights empirically on synthetic data, and on real-world clustering applications."

--- The last tag is wildly ambitious for the undergrad class
to:NB  kernel_methods  hilbert_space  johnson-lindenstrauss  random_projections  data_mining  to_teach:data-mining
29 days ago by cshalizi
[1907.12652] How model accuracy and explanation fidelity influence user trust
"Machine learning systems have become popular in fields such as marketing, financing, or data mining. While they are highly accurate, complex machine learning systems pose challenges for engineers and users. Their inherent complexity makes it impossible to easily judge their fairness and the correctness of statistically learned relations between variables and classes. Explainable AI aims to solve this challenge by modelling explanations alongside with the classifiers, potentially improving user trust and acceptance. However, users should not be fooled by persuasive, yet untruthful explanations. We therefore conduct a user study in which we investigate the effects of model accuracy and explanation fidelity, i.e. how truthfully the explanation represents the underlying model, on user trust. Our findings show that accuracy is more important for user trust than explainability. Adding an explanation for a classification result can potentially harm trust, e.g. when adding nonsensical explanations. We also found that users cannot be tricked by high-fidelity explanations into having trust for a bad classifier. Furthermore, we found a mismatch between observed (implicit) and self-reported (explicit) trust."
to:NB  machine_learning  data_mining  to_teach:data-mining
5 weeks ago by cshalizi
[1905.05134] What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use
"Translating machine learning (ML) models effectively to clinical practice requires establishing clinicians' trust. Explainability, or the ability of an ML model to justify its outcomes and assist clinicians in rationalizing the model prediction, has been generally understood to be critical to establishing trust. However, the field suffers from the lack of concrete definitions for usable explanations in different settings. To identify specific aspects of explainability that may catalyze building trust in ML models, we surveyed clinicians from two distinct acute care specialties (Intenstive Care Unit and Emergency Department). We use their feedback to characterize when explainability helps to improve clinicians' trust in ML models. We further identify the classes of explanations that clinicians identified as most relevant and crucial for effective translation to clinical practice. Finally, we discern concrete metrics for rigorous evaluation of clinical explainability methods. By integrating perceptions of explainability between clinicians and ML researchers we hope to facilitate the endorsement and broader adoption and sustained use of ML systems in healthcare."
to:NB  machine_learning  data_mining  explanation  medicine  goldenberg.anna  to_teach:data-mining
5 weeks ago by cshalizi
[1908.02591] Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics
"Anti-money laundering (AML) regulations play a critical role in safeguarding financial systems, but bear high costs for institutions and drive financial exclusion for those on the socioeconomic and international margins. The advent of cryptocurrency has introduced an intriguing paradox: pseudonymity allows criminals to hide in plain sight, but open data gives more power to investigators and enables the crowdsourcing of forensic analysis. Meanwhile advances in learning algorithms show great promise for the AML toolkit. In this workshop tutorial, we motivate the opportunity to reconcile the cause of safety with that of financial inclusion. We contribute the Elliptic Data Set, a time series graph of over 200K Bitcoin transactions (nodes), 234K directed payment flows (edges), and 166 node features, including ones based on non-public data; to our knowledge, this is the largest labelled transaction data set publicly available in any cryptocurrency. We share results from a binary classification task predicting illicit transactions using variations of Logistic Regression (LR), Random Forest (RF), Multilayer Perceptrons (MLP), and Graph Convolutional Networks (GCN), with GCN being of special interest as an emergent new method for capturing relational information. The results show the superiority of Random Forest (RF), but also invite algorithmic work to combine the respective powers of RF and graph methods. Lastly, we consider visualization for analysis and explainability, which is difficult given the size and dynamism of real-world transaction graphs, and we offer a simple prototype capable of navigating the graph and observing model performance on illicit activity over time. With this tutorial and data set, we hope to a) invite feedback in support of our ongoing inquiry, and b) inspire others to work on this societally important challenge."
to:NB  bitcoin  network_data_analysis  classifiers  statistics  data_mining  crime  to_teach:data-mining
5 weeks ago by cshalizi
Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
"Context-predicting models (more commonly known as embeddings or neural language models) are the new kids on the distributional semantics block. Despite the buzz surrounding these models, the literature is still lacking a systematic comparison of the predictive models with classic, count-vector-based distributional semantic approaches. In this paper, we perform such an extensive evaluation, on a wide range of lexical semantics tasks and across many parameter settings. The results, to our own surprise, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counterparts."
to:NB  have_read  natural_language_processing  text_mining  word2vec  data_mining  to_teach:data-mining
5 weeks ago by cshalizi
[1402.3722] word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method
"The word2vec software of Tomas Mikolov and colleagues (this https URL ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations.
"This note is an attempt to explain equation (4) (negative sampling) in "Distributed Representations of Words and Phrases and their Compositionality" by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean."
to:NB  natural_language_processing  text_mining  statistics  neural_networks  data_mining  word2vec  have_read  to_teach:data-mining
5 weeks ago by cshalizi
[1301.3781] Efficient Estimation of Word Representations in Vector Space
"We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities."

--- The last tag is added with an air of "do I really have to?"
to:NB  have_read  neural_networks  text_mining  word2vec  data_mining  to_teach:data-mining
5 weeks ago by cshalizi
[1907.12720] Exploring large scale public medical image datasets
"Rationale and Objectives: Medical artificial intelligence systems are dependent on well characterised large scale datasets. Recently released public datasets have been of great interest to the field, but pose specific challenges due to the disconnect they cause between data generation and data usage, potentially limiting the utility of these datasets.
"Materials and Methods: We visually explore two large public datasets, to determine how accurate the provided labels are and whether other subtle problems exist. The ChestXray14 dataset contains 112,120 frontal chest films, and the MURA dataset contains 40,561 upper limb radiographs. A subset of around 700 images from both datasets was reviewed by a board-certified radiologist, and the quality of the original labels was determined.
"Results: The ChestXray14 labels did not accurately reflect the visual content of the images, with positive predictive values mostly between 10% and 30% lower than the values presented in the original documentation. There were other significant problems, with examples of hidden stratification and label disambiguation failure. The MURA labels were more accurate, but the original normal/abnormal labels were inaccurate for the subset of cases with degenerative joint disease, with a sensitivity of 60% and a specificity of 82%.
"Conclusion: Visual inspection of images is a necessary component of understanding large image datasets. We recommend that teams producing public datasets should perform this important quality control procedure and include a thorough description of their findings, along with an explanation of the data generating procedures and labelling rules, in the documentation for their datasets."
to:NB  data_sets  data_mining  to_teach:data-mining  statistics
5 weeks ago by cshalizi
[1908.01251] Measuring the Algorithmic Convergence of Randomized Ensembles: The Regression Setting
"When randomized ensemble methods such as bagging and random forests are implemented, a basic question arises: Is the ensemble large enough? In particular, the practitioner desires a rigorous guarantee that a given ensemble will perform nearly as well as an ideal infinite ensemble (trained on the same data). The purpose of the current paper is to develop a bootstrap method for solving this problem in the context of regression --- which complements our companion paper in the context of classification (Lopes 2019). In contrast to the classification setting, the current paper shows that theoretical guarantees for the proposed bootstrap can be established under much weaker assumptions. In addition, we illustrate the flexibility of the method by showing how it can be adapted to measure algorithmic convergence for variable selection. Lastly, we provide numerical results demonstrating that the method works well in a range of situations."
to:NB  ensemble_methods  computational_statistics  statistics  to_teach:data-mining
6 weeks ago by cshalizi
[1810.02909] On the Art and Science of Machine Learning Explanations
"This text discusses several popular explanatory methods that go beyond the error measurements and plots traditionally used to assess machine learning models. Some of the explanatory methods are accepted tools of the trade while others are rigorously derived and backed by long-standing theory. The methods, decision tree surrogate models, individual conditional expectation (ICE) plots, local interpretable model-agnostic explanations (LIME), partial dependence plots, and Shapley explanations, vary in terms of scope, fidelity, and suitable application domain. Along with descriptions of these methods, this text presents real-world usage recommendations supported by a use case and public, in-depth software examples for reproducibility."
to:NB  data_mining  prediction  explanation  to_teach:data-mining
6 weeks ago by cshalizi
[1904.02101] The Landscape of R Packages for Automated Exploratory Data Analysis
"The increasing availability of large but noisy data sets with a large number of heterogeneous variables leads to the increasing interest in the automation of common tasks for data analysis. The most time-consuming part of this process is the Exploratory Data Analysis, crucial for better domain understanding, data cleaning, data validation, and feature engineering. "
There is a growing number of libraries that attempt to automate some of the typical Exploratory Data Analysis tasks to make the search for new insights easier and faster. In this paper, we present a systematic review of existing tools for Automated Exploratory Data Analysis (autoEDA). We explore the features of twelve popular R packages to identify the parts of analysis that can be effectively automated with the current tools and to point out new directions for further autoEDA development.
to:NB  R  exploratory_data_analysis  data_analysis  statistics  to_teach:data-mining
6 weeks ago by cshalizi
[1907.08742] Estimating the Algorithmic Variance of Randomized Ensembles via the Bootstrap
"Although the methods of bagging and random forests are some of the most widely used prediction methods, relatively little is known about their algorithmic convergence. In particular, there are not many theoretical guarantees for deciding when an ensemble is "large enough" --- so that its accuracy is close to that of an ideal infinite ensemble. Due to the fact that bagging and random forests are randomized algorithms, the choice of ensemble size is closely related to the notion of "algorithmic variance" (i.e. the variance of prediction error due only to the training algorithm). In the present work, we propose a bootstrap method to estimate this variance for bagging, random forests, and related methods in the context of classification. To be specific, suppose the training dataset is fixed, and let the random variable Errt denote the prediction error of a randomized ensemble of size t. Working under a "first-order model" for randomized ensembles, we prove that the centered law of Errt can be consistently approximated via the proposed method as t→∞. Meanwhile, the computational cost of the method is quite modest, by virtue of an extrapolation technique. As a consequence, the method offers a practical guideline for deciding when the algorithmic fluctuations of Errt are negligible."
to:NB  ensemble_methods  computational_statistics  statistics  prediction  to_teach:data-mining
7 weeks ago by cshalizi
[1907.09013] Conscientious Classification: A Data Scientist's Guide to Discrimination-Aware Classification
"Recent research has helped to cultivate growing awareness that machine learning systems fueled by big data can create or exacerbate troubling disparities in society. Much of this research comes from outside of the practicing data science community, leaving its members with little concrete guidance to proactively address these concerns. This article introduces issues of discrimination to the data science community on its own terms. In it, we tour the familiar data mining process while providing a taxonomy of common practices that have the potential to produce unintended discrimination. We also survey how discrimination is commonly measured, and suggest how familiar development processes can be augmented to mitigate systems' discriminatory potential. We advocate that data scientists should be intentional about modeling and reducing discriminatory outcomes. Without doing so, their efforts will result in perpetuating any systemic discrimination that may exist, but under a misleading veil of data-driven objectivity."
to:NB  classifiers  algorithmic_fairness  prediction  to_teach:data-mining  o'neil.cathy
7 weeks ago by cshalizi
[1907.08679] Recommender Systems with Heterogeneous Side Information
"In modern recommender systems, both users and items are associated with rich side information, which can help understand users and items. Such information is typically heterogeneous and can be roughly categorized into flat and hierarchical side information. While side information has been proved to be valuable, the majority of existing systems have exploited either only flat side information or only hierarchical side information due to the challenges brought by the heterogeneity. In this paper, we investigate the problem of exploiting heterogeneous side information for recommendations. Specifically, we propose a novel framework jointly captures flat and hierarchical side information with mathematical coherence. We demonstrate the effectiveness of the proposed framework via extensive experiments on various real-world datasets. Empirical results show that our approach is able to lead a significant performance gain over the state-of-the-art methods."
to:NB  recommender_systems  prediction  to_teach:data-mining
7 weeks ago by cshalizi
[1907.07384] Feature Selection via Mutual Information: New Theoretical Insights
"Mutual information has been successfully adopted in filter feature-selection methods to assess both the relevancy of a subset of features in predicting the target variable and the redundancy with respect to other variables. However, existing algorithms are mostly heuristic and do not offer any guarantee on the proposed solution. In this paper, we provide novel theoretical results showing that conditional mutual information naturally arises when bounding the ideal regression/classification errors achieved by different subsets of features. Leveraging on these insights, we propose a novel stopping condition for backward and forward greedy methods which ensures that the ideal prediction error using the selected feature subset remains bounded by a user-specified threshold. We provide numerical simulations to support our theoretical claims and compare to common heuristic methods."
in_NB  variable_selection  information_theory  statistics  to_teach:data-mining
8 weeks ago by cshalizi
Life by Algorithms: How Roboprocesses Are Remaking Our World, Besteman, Gusterson
"Computerized processes are everywhere in our society. They are the automated phone messaging systems that businesses use to screen calls; the link between student standardized test scores and public schools’ access to resources; the algorithms that regulate patient diagnoses and reimbursements to doctors. The storage, sorting, and analysis of massive amounts of information have enabled the automation of decision-making at an unprecedented level. Meanwhile, computers have offered a model of cognition that increasingly shapes our approach to the world. The proliferation of “roboprocesses” is the result, as editors Catherine Besteman and Hugh Gusterson observe in this rich and wide-ranging volume, which features contributions from a distinguished cast of scholars in anthropology, communications, international studies, and political science.
"Although automatic processes are designed to be engines of rational systems, the stories in Life by Algorithms reveal how they can in fact produce absurd, inflexible, or even dangerous outcomes. Joining the call for “algorithmic transparency,” the contributors bring exceptional sensitivity to everyday sociality into their critique to better understand how the perils of modern technology affect finance, medicine, education, housing, the workplace, food production, public space, and emotions—not as separate problems but as linked manifestations of a deeper defect in the fundamental ordering of our society."
to:NB  books:noted  bureaucracy  data_mining  to_teach:data-mining
june 2019 by cshalizi
[1906.04711] ProPublica's COMPAS Data Revisited
"In this paper I re-examine the COMPAS recidivism score and criminal history data collected by ProPublica in 2016, which has fueled intense debate and research in the nascent field of algorithmic fairness' or fair machine learning' over the past three years. ProPublica's COMPAS data is used in an ever-increasing number of studies to test various definitions and methodologies of algorithmic fairness. This paper takes a closer look at the actual datasets put together by ProPublica. In particular, I examine the distribution of defendants across COMPAS screening dates and find that ProPublica made an important data processing mistake when it created some of the key datasets most often used by other researchers. Specifically, the datasets built to study the likelihood of recidivism within two years of the original COMPAS screening date. As I show in this paper, ProPublica made a mistake implementing the two-year sample cutoff rule for recidivists in such datasets (whereas it implemented an appropriate two-year sample cutoff rule for non-recidivists). As a result, ProPublica incorrectly kept a disproportionate share of recidivists. This data processing mistake leads to biased two-year recidivism datasets, with artificially high recidivism rates. This also affects the positive and negative predictive values. On the other hand, this data processing mistake does not impact some of the key statistical measures highlighted by ProPublica and other researchers, such as the false positive and false negative rates, nor the overall accuracy."
to:NB  data_sets  crime  prediction  to_teach:data-mining  algorithmic_fairness
june 2019 by cshalizi
[1808.06581] The Deconfounded Recommender: A Causal Inference Approach to Recommendation
"The goal of recommendation is to show users items that they will like. Though usually framed as a prediction, the spirit of recommendation is to answer an interventional question---for each user and movie, what would the rating be if we "forced" the user to watch the movie? To this end, we develop a causal approach to recommendation, one where watching a movie is a "treatment" and a user's rating is an "outcome." The problem is there may be unobserved confounders, variables that affect both which movies the users watch and how they rate them; unobserved confounders impede causal predictions with observational data. To solve this problem, we develop the deconfounded recommender, a way to use classical recommendation models for causal recommendation. Following Wang & Blei [23], the deconfounded recommender involves two probabilistic models. The first models which movies the users watch; it provides a substitute for the unobserved confounders. The second one models how each user rates each movie; it employs the substitute to help account for confounders. This two-stage approach removes bias due to confounding. It improves recommendation and enjoys stable performance against interventions on test sets."
causal_inference  collaborative_filtering  blei.david  in_NB  to_teach:data-mining  to_read  recommender_systems
may 2019 by cshalizi
Keeping Score: Predictive Analytics in Policing | Annual Review of Criminology
"Predictive analytics in policing is a data-driven approach to (a) characterizing crime patterns across time and space and (b) leveraging this knowledge for the prevention of crime and disorder. This article outlines the current state of the field, providing a review of forecasting tools that have been successfully applied by police to the task of crime prediction. We then discuss options for structured design and evaluation of a predictive policing program so that the benefits of proactive intervention efforts are maximized given fixed resource constraints. We highlight examples of predictive policing programs that have been implemented and evaluated by police agencies in the field. Finally, we discuss ethical issues related to predictive analytics in policing and suggest approaches for minimizing potential harm to vulnerable communities while providing an equitable distribution of the benefits of crime prevention across populations within police jurisdiction."
to:NB  police  crime  prediction  data_mining  to_teach:data-mining
may 2019 by cshalizi
Do ImageNet Classifiers Generalize to ImageNet?
"We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been
the focus of intense research for almost a decade, raising the danger of overfitting to excessively
re-used test sets. By closely following the original dataset creation processes, we test to what
extent current classification models generalize to new data. We evaluate a broad range of models
and find accuracy drops of 3% – 15% on CIFAR-10 and 11% – 14% on ImageNet. However,
accuracy gains on the original test sets translate to larger gains on the new test sets. Our results
suggest that the accuracy drops are not caused by adaptivity, but by the models’ inability to
generalize to slightly “harder” images than those found in the original test sets."

--- The astonishing thing to me is the _linear_ relationship between accuracy on the old and new data-set versions. It's uncannily good. (Also: tiny changes in data-preparation make a big difference!)
to:NB  have_read  classifiers  neural_networks  data_sets  to_teach:data-mining
february 2019 by cshalizi
Recognising when you don’t know - Biased and Inefficient
(Some nice shade is thrown on the difference between machine learning and statistics --- excuse me, "data science".)
february 2019 by cshalizi
How a Feel-Good AI Story Went Wrong in Flint - The Atlantic
Interesting (and depressing) in so many ways. (The least of which is grist for my "AI is really ML, and ML is really regression" mill.)
classifiers  data_mining  our_decrepit_institutions  infrastructure  public_policy  have_read  via:?  to_teach:data-mining  to_teach:data_over_space_and_time
january 2019 by cshalizi
[1808.00023] The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning
"The nascent field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last several years, three formal definitions of fairness have gained prominence: (1) anti-classification, meaning that protected attributes---like race, gender, and their proxies---are not explicitly used to make decisions; (2) classification parity, meaning that common measures of predictive performance (e.g., false positive and false negative rates) are equal across groups defined by the protected attributes; and (3) calibration, meaning that conditional on risk estimates, outcomes are independent of protected attributes. Here we show that all three of these fairness definitions suffer from significant statistical limitations. Requiring anti-classification or classification parity can, perversely, harm the very groups they were designed to protect; and calibration, though generally desirable, provides little guarantee that decisions are equitable. In contrast to these formal fairness criteria, we argue that it is often preferable to treat similarly risky people similarly, based on the most statistically accurate estimates of risk that one can produce. Such a strategy, while not universally applicable, often aligns well with policy objectives; notably, this strategy will typically violate both anti-classification and classification parity. In practice, it requires significant effort to construct suitable risk estimates. One must carefully define and measure the targets of prediction to avoid retrenching biases in the data. But, importantly, one cannot generally address these difficulties by requiring that algorithms satisfy popular mathematical formalizations of fairness. By highlighting these challenges in the foundation of fair machine learning, we hope to help researchers and practitioners productively advance the area."

--- ETA: This is a really good and convincing paper.
august 2018 by cshalizi
Word embeddings quantify 100 years of gender and ethnic stereotypes | PNAS
"Word embeddings are a powerful machine-learning framework that represents each English word by a vector. The geometric relationship between these vectors captures meaningful semantic relationships between the corresponding words. In this paper, we develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States. We integrate word embeddings trained on 100 y of text data with the US Census to show that changes in the embedding track closely with demographic and occupation shifts over time. The embedding captures societal shifts—e.g., the women’s movement in the 1960s and Asian immigration into the United States—and also illuminates how specific adjectives and occupations became more closely associated with certain populations over time. Our framework for temporal analysis of word embedding opens up a fruitful intersection between machine learning and quantitative social science."
to:NB  text_mining  sociology  sexism  racism  history_of_ideas  time_series  to_teach:data-mining  to_teach:data_over_space_and_time
april 2018 by cshalizi
This is how Cambridge Analytica’s Facebook targeting model really worked — according to the person who built it » Nieman Journalism Lab
If this is accurate, they weren't even using the Big Five as a bottleneck for their predictions...

Evil thought: would this help to motivate the students the next time I teach principal components & factor analysis?
data_mining  us_politics  networked_life  factor_analysis  statistics  to_teach:data-mining  cambridge_analytica  facebook
april 2018 by cshalizi
[1802.03426] UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
"UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP as described has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning"
to:NB  via:vaguery  manifold_learning  dimension_reduction  data_analysis  data_mining  to_teach:data-mining  re:ADAfaEPoV
march 2018 by cshalizi
Pretty good, reading "machine learning", or even "statistical modeling", for "artificial intelligence" throughout (as he more or less admits up front). Worth teaching in particular for the black-faces-as-gorillas disaster.
machine_learning  data_mining  to_teach:data-mining
february 2018 by cshalizi
Artificial Unintelligence: How Computers Misunderstand the World | The MIT Press
"In Artificial Unintelligence, Meredith Broussard argues that our collective enthusiasm for applying computer technology to every aspect of life has resulted in a tremendous amount of poorly designed systems. We are so eager to do everything digitally—hiring, driving, paying bills, even choosing romantic partners—that we have stopped demanding that our technology actually work. Broussard, a software developer and journalist, reminds us that there are fundamental limits to what we can (and should) do with technology. With this book, she offers a guide to understanding the inner workings and outer limits of technology—and issues a warning that we should never assume that computers always get things right.
"Making a case against technochauvinism—the belief that technology is always the solution—Broussard argues that it’s just not true that social problems would inevitably retreat before a digitally enabled Utopia. To prove her point, she undertakes a series of adventures in computer programming. She goes for an alarming ride in a driverless car, concluding “the cyborg future is not coming any time soon”; uses artificial intelligence to investigate why students can’t pass standardized tests; deploys machine learning to predict which passengers survived the Titanic disaster; and attempts to repair the U.S. campaign finance system by building AI software. If we understand the limits of what we can do with technology, Broussard tells us, we can make better choices about what we should do with it to make the world better for everyone."
in_NB  books:noted  machine_learning  artificial_intelligence  computers  data_analysis  to_teach:data-mining  from_library
december 2017 by cshalizi
Machine Learning for Data Streams | The MIT Press
"Today many information sources—including sensor networks, financial markets, social networks, and healthcare monitoring—are so-called data streams, arriving sequentially and at high speed. Analysis must take place in real time, with partial data and without the capacity to store the entire data set. This book presents algorithms and techniques used in data stream mining and real-time analytics. Taking a hands-on approach, the book demonstrates the techniques using MOA (Massive Online Analysis), a popular, freely available open-source software framework, allowing readers to try out the techniques after reading the explanations.
"The book first offers a brief introduction to the topic, covering big data mining, basic methodologies for mining data streams, and a simple example of MOA. More detailed discussions follow, with chapters on sketching techniques, change, classification, ensemble methods, regression, clustering, and frequent pattern mining. Most of these chapters include exercises, an MOA-based lab session, or both. Finally, the book discusses the MOA software, covering the MOA graphical user interface, the command line, use of its API, and the development of new methods within MOA. The book will be an essential reference for readers who want to use data stream mining as a tool, researchers in innovation or data stream mining, and programmers who want to create new algorithms for MOA."
to:NB  books:noted  data_mining  streaming_algorithms  to_teach:data-mining
december 2017 by cshalizi
Big Data Surveillance: The Case of Policing - American Sociological Review - Sarah Brayne, 2017
"This article examines the intersection of two structural developments: the growth of surveillance and the rise of “big data.” Drawing on observations and interviews conducted within the Los Angeles Police Department, I offer an empirical account of how the adoption of big data analytics does—and does not—transform police surveillance practices. I argue that the adoption of big data analytics facilitates amplifications of prior surveillance practices and fundamental transformations in surveillance activities. First, discretionary assessments of risk are supplemented and quantified using risk scores. Second, data are used for predictive, rather than reactive or explanatory, purposes. Third, the proliferation of automatic alert systems makes it possible to systematically surveil an unprecedentedly large number of people. Fourth, the threshold for inclusion in law enforcement databases is lower, now including individuals who have not had direct police contact. Fifth, previously separate data systems are merged, facilitating the spread of surveillance into a wide range of institutions. Based on these findings, I develop a theoretical model of big data surveillance that can be applied to institutional domains beyond the criminal justice system. Finally, I highlight the social consequences of big data surveillance for law and social inequality."
to:NB  police  surveillance  data_mining  to_read  to_teach:data-mining  via:kjhealy
september 2017 by cshalizi
Seeing Like a Market
"What do markets see when they look at people? Information dragnets increasingly yield huge quantities of individual-level data, which are analyzed to sort and slot people into categories of taste, riskiness or worth. These tools deepen the reach of the market and define new strategies of profit-making. We present a new theoretical framework for understanding their development. We argue that a) modern organizations follow an institutional data imperative to collect as much data as possible; b) as a result of the analysis and use of this data, individuals accrue a form of capital flowing from their positions as measured by various digital scoring and ranking methods; and c) the facticity of these scoring methods makes them organizational devices with potentially stratifying effects. They offer firms new opportunities to structure and price offerings to consumers. For individuals, they create classification situations that identify shared life-chances in product and service markets. We discuss the implications of these processes and argue that they tend toward a new economy of moral judgment, where outcomes are experienced as morally deserved positions based on prior good actions and good tastes, as measured and classified by this new infrastructure of data collection and analysis."
to:NB  economics  sociology  credit_ratings  re:g_paper  healy.kieran  to_teach:data-mining
august 2017 by cshalizi
[1707.00044] Learning Fair Classifiers: A Regularization-Inspired Approach
"We present a regularization-inspired approach for reducing bias in learned classifiers. In particular, we focus on binary classification tasks over individuals from two populations, where, as our criterion for fairness, we wish to achieve similar false positive rates in both populations, and similar false negative rates in both populations. As a proof of concept, we implement our approach and empirically evaluate its ability to achieve both fairness and accuracy, using the COMPAS scores data for prediction of recidivism."

--- Last tag is tentative, until I've read this.
july 2017 by cshalizi
Data Firm Says ‘Secret Sauce’ Aided Trump; Many Scoff - The New York Times
What's notable about this piece is that it provides almost no information about whether their much-touted product _works_. Everything is about whether sources liked it, or used it, or thought it was effective; not who's right.
deceiving_us_has_become_an_industrial_process  data_mining  networked_life  psychometrics  factor_analysis  to_teach:data-mining
march 2017 by cshalizi
The Humans Working Behind the AI Curtain
This is _not_ what we mean when we talk about using computers to expand human capacities.
artificial_intelligence  machine_learning  networked_life  to_teach:data-mining
january 2017 by cshalizi
Big Data: Does Size Matter?: Timandra Harkness: Bloomsbury Sigma
"What is Big Data, and why should you care?
"Big data knows where you've been and who your friends are. It knows what you like and what makes you angry. It can predict what you'll buy, where you'll be the victim of crime and when you'll have a heart attack. Big data knows you better than you know yourself, or so it claims.
"But how well do you know big data?
"You've probably seen the phrase in newspaper headlines, at work in a marketing meeting, or on a fitness-tracking gadget. But can you understand it without being a Silicon Valley nerd who writes computer programs for fun?
"Yes. Yes, you can.
"Timandra Harkness writes comedy, not computer code. The only programmes she makes are on the radio. If you can read a newspaper you can read this book.
"Starting with the basics – what IS data? And what makes it big? – Timandra takes you on a whirlwind tour of how people are using big data today: from science to smart cities, business to politics, self-quantification to the Internet of Things.
"Finally, she asks the big questions about where it's taking us; is it too big for its boots, or does it think too small? Are you a data point or a human being? Will this book be full of rhetorical questions?
"No. It also contains puns, asides, unlikely stories and engaging people, inspiring feats and thought-provoking dilemmas. Leaving you armed and ready to decide what you think about one of the decade's big ideas: big data."

--- As usual, the last tag is tentative.
to:NB  books:noted  data_mining  popular_social_science  via:?  to_teach:data-mining
december 2016 by cshalizi
[0906.4391] KNIFE: Kernel Iterative Feature Extraction
"Selecting important features in non-linear or kernel spaces is a difficult challenge in both classification and regression problems. When many of the features are irrelevant, kernel methods such as the support vector machine and kernel ridge regression can sometimes perform poorly. We propose weighting the features within a kernel with a sparse set of weights that are estimated in conjunction with the original classification or regression problem. The iterative algorithm, KNIFE, alternates between finding the coefficients of the original problem and finding the feature weights through kernel linearization. In addition, a slight modification of KNIFE yields an efficient algorithm for finding feature regularization paths, or the paths of each feature's weight. Simulation results demonstrate the utility of KNIFE for both kernel regression and support vector machines with a variety of kernels. Feature path realizations also reveal important non-linear correlations among features that prove useful in determining a subset of significant variables. Results on vowel recognition data, Parkinson's disease data, and microarray data are also given."
in_NB  statistics  regression  variable_selection  data_mining  to_teach:data-mining  kernel_methods  heard_the_talk
november 2016 by cshalizi
Social Network Analysis in Predictive Policing - | Mohammad A. Tayebi | Springer
"This book focuses on applications of social network analysis in predictive policing. Data science is used to identify potential criminal activity by analyzing the relationships between offenders to fully understand criminal collaboration patterns. Co-offending networks—networks of offenders who have committed crimes together—have long been recognized by law enforcement and intelligence agencies as a major factor in the design of crime prevention and intervention strategies. Despite the importance of co-offending network analysis for public safety, computational methods for analyzing large-scale criminal networks are rather premature. This book extensively and systematically studies co-offending network analysis as effective tool for predictive policing. The formal representation of criminological concepts presented here allow computer scientists to think about algorithmic and computational solutions to problems long discussed in the criminology literature. For each of the studied problems, we start with well-founded concepts and theories in criminology, then propose a computational method and finally provide a thorough experimental evaluation, along with a discussion of the results. In this way, the reader will be able to study the complete process of solving real-world multidisciplinary problems."
to:NB  books:noted  social_networks  network_data_analysis  surveillance  intelligence_(spying)  national_surveillance_state  data_mining  to_teach:data-mining  to_teach:baby-nets  to_be_shot_after_a_fair_trial
november 2016 by cshalizi
Mastering Feature Engineering - O'Reilly Media
"Feature engineering is essential to applied machine learning, but using domain knowledge to strengthen your predictive models can be difficult and expensive. To help fill the information gap on feature engineering, this complete hands-on guide teaches beginning-to-intermediate data scientists how to work with this widely practiced but little discussed topic.
"Author Alice Zheng explains common practices and mathematical principles to help engineer features for new data and tasks. If you understand basic machine learning concepts like supervised and unsupervised learning, you’re ready to get started. Not only will you learn how to implement feature engineering in a systematic and principled way, you’ll also learn how to practice better data science."
to:NB  books:noted  data_mining  statistics  data_analysis  to_teach:data-mining  kith_and_kin  zheng.alice
june 2016 by cshalizi
Surfeit and surface | Big Data & Society
This is awesome. (But it's also completely compatible with causal inference!) Also, the cultural references will probably require footnotes in just 10 years.