Data Love - The Seduction and Betrayal of Digital Technologies | Columbia University Press
"Intelligence services, government administrations, businesses, and a growing majority of the population are hooked on the idea that big data can reveal patterns and correlations in everyday life. Initiated by software engineers and carried out through algorithms, the mining of big data has sparked a silent revolution. But algorithmic analysis and data mining are not simply byproducts of media development or the logical consequences of computation. They are the radicalization of the Enlightenment's quest for knowledge and progress. Data Love argues that the "cold civil war" of big data is taking place not among citizens or between the citizen and government but within each of us. Roberto Simanowski elaborates on the changes data love has brought to the human condition while exploring the entanglements of those who—out of stinginess, convenience, ignorance, narcissism, or passion—contribute to the amassing of ever more data about their lives, leading to the statistical evaluation and individual profiling of their selves. Writing from a philosophical standpoint, Simanowski illustrates the social implications of technological development and retrieves the concepts, events, and cultural artifacts of past centuries to help decode the programming of our present."
Home - OpenMinTeD
"OpenMinted sets out to create an open, service-oriented ep-Infrastructure for Text and Data Mining (TDM) of scientific and scholarly content. Researchers can collaboratively create, discover, share and re-use Knowledge from a wide range of text-based scientific related sources in a seamless way."
AMiner - Open Science Platform
"AMiner (aminer.org) aims to provide comprehensive search and mining services for researcher social networks. In this system, we focus on: (1) creating a semantic-based profile for each researcher by extracting information from the distributed Web; (2) integrating academic data (e.g., the bibliographic data and the researcher profiles) from multiple sources; (3) accurately searching the heterogeneous network; (4) analyzing and discovering interesting patterns from the built researcher social network."
Post, Mine, Repeat - Social Media Data Mining | Helen Kennedy | Palgrave Macmillan
"In this book, Helen Kennedy argues that as social media data mining becomes more and more ordinary, as we post, mine and repeat, new data relations emerge. These new data relations are characterised by a widespread desire for numbers and the troubling consequences of this desire, and also by the possibility of doing good with data and resisting data power, by new and old concerns, and by instability and contradiction. Drawing on action research with public sector organisations, interviews with commercial social insights companies and their clients, focus groups with social media users and other research, Kennedy provides a fascinating and detailed account of living with social media data mining inside the organisations that make up the fabric of everyday life."
rOpenSci - Open Tools for Open Science
"At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists. Our tools not only facilitate drawing data into an environment where it can readily be manipulated, but also one in which those analyses and methods can be easily shared, replicated, and extended by other researchers. We develop open source R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories that provide real-time metrics of scholarly impact. Visit our packages section for a full list of production and development versions of packages."
Unique in the Crowd: The privacy bounds of human mobility : Scientific Reports : Nature Publishing Group
"We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier's antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual's privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals."
Big Data
"Big Data, a highly innovative, open access peer-reviewed journal, provides a unique forum for world-class research exploring the challenges and opportunities in collecting, analyzing, and disseminating vast amounts of data, including data science, big data infrastructure and analytics, and pervasive computing."
The Database of Intentions | John Battelle's Search BlogJohn Battelle's Search Blog
"The Database of Intentions is simply this: The aggregate results of every search ever entered, every result list ever tendered, and every path taken as a result. It lives in many places, but three or four places in particular hold a massive amount of this data (ie MSN, Google, and Yahoo). This information represents, in aggregate form, a place holder for the intentions of humankind – a massive database of desires, needs, wants, and likes that can be discovered, supoenaed, archived, tracked, and exploited to all sorts of ends. Such a beast has never before existed in the history of culture, but is almost guaranteed to grow exponentially from this day forward. This artifact can tell us extraordinary things about who we are and what we want as a culture. And it has the potential to be abused in equally extraordinary fashion. "
Big data is our generation’s civil rights issue, and we don’t know it - O'Reilly Radar
"Data doesn’t invade people’s lives. Lack of control over how it’s used does.

What’s really driving so-called big data isn’t the volume of information. It turns out big data doesn’t have to be all that big. Rather, it’s about a reconsideration of the fundamental economics of analyzing data."
"Livehoods offer a new way to conceptualize the dynamics, structure, and character of a city by analyzing the social media its residents generate. By looking at people's checkin patterns at places across the city, we create a mapping of the different dynamic areas that comprise it. Each Livehood tells a different story of the people and places that shape it. "
PAKDD 2012
The 16th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) is pleased to organize a data mining competition.
Why DH has no future. | The Stone and the Shell
Let me just say that any area of scholarship where, in 20-fucking-12, the idea of moving to open-access, online distribution of writing counts as some kind of radicalism deserves everything that's going to happen to it.
Peak Attention and the Colonization of Subcultures
"The question of how such coded language emerges, spreads and evolves is a big one. I am interested in a very specific question: how do members of an emerging subculture recognize each other in public, especially on the Internet, using more specialized coded language?

The question is interesting because the Web is making traditional subcultures — historically illegible to governance mechanisms, and therefore hotbeds of subversion — increasingly visible and open to cheap, large-scale economic and political exploitation. This exploitation takes the form of attention mining, and is the end-game on the path to what I called Peak Attention a while back.

Does this mean the subversive potential of the Internet is an illusion, and that it will ultimately be domesticated? Possibly." Annotated link http://www.diigo.com/bookmark/http://www.ribbonfarm.com/2012/01/27/peak-attention-and-the-colonization-of-subcultures
The privacy arc - O'Reilly Radar
Mike Loukides argues that privacy worries are result of persisting attitudes from the 1950s atomization of modern society.
An ethical bargain - O'Reilly Radar
"Okay ... Let me just ask this: If you are involved in data capture, analytics, or customer marketing in your company, would you be embarrassed to admit to your neighbor what about them you capture, store and analyze? Would you be willing to send them a zip file with all of it to let them see it? If the answer is "no," why not? If I might hazard a guess at the answer, it would be because real relationships aren't built on asymmetry, and you know that. But rather than eliminate that awkward source of asymmetry, you hide it."
[1103.6038] Searching for comets on the World Wide Web: The orbit of 17P/Holmes from the behavior of photographers
"We performed an image search on Yahoo for "Comet Holmes" on 2010 April 1. Thousands of images were returned. We astrometrically calibrated---and therefore vetted---the images using the Astrometry.net system. The calibrated image pointings form a set of data points to which we can fit a test-particle orbit in the Solar System, marginalizing out image dates and catching outliers. The approach is Bayesian and the model is, in essence, a model of how comet astrophotographers point their instruments. We find very strong probabilistic constraints on the orbit, although slightly off the JPL ephemeris, probably because of limitations of the astronomer model. Hyper-parameters of the model constrain the reliability of date meta-data and where in the image astrophotographers place the comet
Astronomers Calculate Comet's Orbit Using Amateur Images From The Web - Technology Review
"This sudden brightening triggered a huge wave of interest from astrophotographers all over the world, many of whom posted their images on the web. To find out how many, Dustin Lang from Princeton University in New Jersey and David Hogg at the Max-Planck-Institut fur Astronomie in Heidelberg, Germany, searched the web. They found 2476 different shots of Holmes.

That's a significant astronomical database that represents a huge amount of work. But is it any use?

Today, Lang and Hogg use these images to work out an accurate orbit of Comet 17P/Holmes, a significant achievement given that the data is taken from an ordinary web search and its provenance is entirely unknown."
BioCaster Global Health Monitor
Based on a combination of text mining algoithms, BioCaster aims to provide an early warning monitoring station for epidemic and environmental diseases (human, animal and plant). It does this by aggregating online news reports, processing them automatically using human language technology and trying to spot unusual trends. For example, the trend spotting algorithm we use on the top page is CDC's Early Aberration Reporting System (EARS) C2 algorithm. Being able to spot unusual health events still requires skilled human analysts for risk assessment and verification. Automated methods like BioCaster try to make human tasks easier by providing intelligently filtered news.

BioCaster started in 2006 and provides a demonstration portal for public health workers, clinicians and researchers. The portal is currently under development at the National Institute of Informatics, Japan
Statistical Data Mining Tutorials
The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms.
The Fourth Paradigm: Data-Intensive Scientific Discovery - Microsoft Research
Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies. In The Fourth Paradigm: Data-Intensive Scientific Discovery, the collection of essays expands on the vision of pioneering computer scientist Jim Gray for a new, fourth paradigm of discovery based on data-intensive science and offers insights into how it can be fully realized.
Personas | Metropath(ologies) | An installation by Aaron Zinman
Enter your name, and Personas scours the web for information and attempts to characterize the person - to fit them to a predetermined set of categories that an algorithmic process created from a massive corpus of data.
SDA: Survey Documentation
SDA is a set of programs for the documentation and Web-based analysis of survey data. There are also procedures for creating customized subsets of datasets. This set of programs is developed and maintained by the Computer-assisted Survey Methods Program (CSM) at the University of California, Berkeley.
Detecting influenza epidemics using search engine query data : Article : Nature
One way to improve early detection is to monitor health-seeking behaviour in the form of queries to online search engines, which are submitted by millions of users around the world each day. Here we present a method of analysing large numbers of Google search queries to track influenza-like illness in a population. Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. This approach may make it possible to use search queries to detect influenza epidemics in areas with a large population of web search users.
Public Data Sets on Amazon Web Services (AWS)
Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications.
Microsoft Research DataDepot - Home
Welcome to DataDepot, a site that lets you track, analyze, and share trend lines.
