jm + big-data   54

Why is this company tracking where you are on Thanksgiving?
Creepy:
To do this, they tapped a company called SafeGraph that provided them with 17 trillion location markers for 10 million smartphones.
The data wasn’t just staggering in sheer quantity. It also appears to be extremely granular. Researchers “used this data to identify individuals' home locations, which they defined as the places people were most often located between the hours of 1 and 4 a.m.,” wrote The Washington Post. [....]
This means SafeGraph is looking at an individual device and tracking where its owner is going throughout their day. A common defense from companies that creepily collect massive amounts of data is that the data is only analyzed in aggregate; for example, Google’s database BigQuery, which allows organizations to upload big data sets and then query them quickly, promises that all its public data sets are “fully anonymized” and “contain no personally-identifying information.” In multiple press releases from SafeGraph’s partners, the company’s location data is referred to as “anonymized,” but in this case they seem to be interpreting the concept of anonymity quite liberally given the specificity of the data.
Most people probably don’t realize that their Thanksgiving habits could end up being scrutinized by strangers.
It’s unclear if users realize that their data is being used this way, but all signs point to no. (SafeGraph and the researchers did not immediately respond to questions.) SafeGraph gets location data from “from numerous smartphone apps,” according to the researchers.
safegraph  apps  mobile  location  tracking  surveillance  android  iphone  ios  smartphones  big-data 
4 weeks ago by jm
Everybody lies: how Google search reveals our darkest secrets | Technology | The Guardian
What can we learn about ourselves from the things we ask online? US data scientist Seth Stephens‑Davidowitz analysed anonymous Google search results, uncovering disturbing truths about [America's] desires, beliefs and prejudices


Fascinating. I find it equally interesting how flawed the existing methodologies for polling and surveying are, compared to Google's data, according to this
science  big-data  google  lying  surveys  polling  secrets  data-science  america  racism  searching 
july 2017 by jm
'Mathwashing,' Facebook and the zeitgeist of data worship
Fred Benenson: Mathwashing can be thought of using math terms (algorithm, model, etc.) to paper over a more subjective reality. For example, a lot of people believed Facebook was using an unbiased algorithm to determine its trending topics, even if Facebook had previously admitted that humans were involved in the process.
maths  math  mathwashing  data  big-data  algorithms  machine-learning  bias  facebook  fred-benenson 
april 2017 by jm
Artificial intelligence is ripe for abuse, tech researcher warns: 'a fascist's dream' | Technology | The Guardian
“We should always be suspicious when machine learning systems are described as free from bias if it’s been trained on human-generated data,” Crawford said. “Our biases are built into that training data.”

In the Chinese research it turned out that the faces of criminals were more unusual than those of law-abiding citizens. “People who had dissimilar faces were more likely to be seen as untrustworthy by police and judges. That’s encoding bias,” Crawford said. “This would be a terrifying system for an autocrat to get his hand on.” [...]

With AI this type of discrimination can be masked in a black box of algorithms, as appears to be the case with a company called Faceception, for instance, a firm that promises to profile people’s personalities based on their faces. In its own marketing material, the company suggests that Middle Eastern-looking people with beards are “terrorists”, while white looking women with trendy haircuts are “brand promoters”.
bias  ai  racism  politics  big-data  technology  fascism  crime  algorithms  faceception  discrimination  computer-says-no 
march 2017 by jm
The Rise of the Data Engineer
Interesting article proposing a new discipline, focused on the data warehouse, from Maxime Beauchemin (creator and main committer on Apache Airflow and Airbnb’s Superset)
data-engineering  engineering  coding  data  big-data  airbnb  maxime-beauchemin  data-warehouse 
january 2017 by jm
The Fall of BIG DATA – arg min blog
Strongly agreed with this -- particularly the second of the three major failures, specifically:
Our community has developed remarkably effective tools to microtarget advertisements. But if you use ad models to deliver news, that’s propaganda. And just because we didn’t intend to spread rampant misinformation doesn’t mean we are not responsible.
big-data  analytics  data-science  statistics  us-politics  trump  data  science  propaganda  facebook  silicon-valley 
november 2016 by jm
LinkedIn called me a white supremacist
Wow. Massive, massive algorithm fail.
n the morning of May 12, LinkedIn, the networking site devoted to making professionals “more productive and successful,” emailed scores of my contacts and told them I’m a professional racist. It was one of those updates that LinkedIn regularly sends its users, algorithmically assembled missives about their connections’ appearances in the media. This one had the innocent-sounding subject, “News About William Johnson,” but once my connections clicked in, they saw a small photo of my grinning face, right above the headline “Trump put white nationalist on list of delegates.” [.....] It turns out that when LinkedIn sends these update emails, people actually read them. So I was getting upset. Not only am I not a Nazi, I’m a Jewish socialist with family members who were imprisoned in concentration camps during World War II. Why was LinkedIn trolling me?
ethics  fail  algorithm  linkedin  big-data  racism  libel 
may 2016 by jm
_DataEngConf: Parquet at Datadog_
"How we use Parquet for tons of metrics data". good preso from Datadog on their S3/Parquet setup
datadog  parquet  storage  s3  databases  hadoop  map-reduce  big-data 
may 2016 by jm
Submitting User Applications with spark-submit - AWS Big Data Blog
looks reasonably usable, although EMR's crappy UI is still an issue
emr  big-data  spark  hadoop  yarn  map-reduce  batch 
february 2016 by jm
Analysing user behaviour - from histograms to random forests (PyData) at PyCon Ireland 2015 | Lanyrd
Swrve's own Dave Brodigan on game user-data analysis techniques:
The goal is to give the audience a roadmap for analysing user data using python friendly tools.

I will touch on many aspects of the data science pipeline from data cleansing to building predictive data products at scale.

I will start gently with pandas and dataframes and then discuss some machine learning techniques like kmeans and random forests in scikitlearn and then introduce Spark for doing it at scale.

I will focus more on the use cases rather than detailed implementation.

The talk will be informed by my experience and focus on user behaviour in games and mobile apps.
swrve  talks  user-data  big-data  spark  hadoop  machine-learning  data-science 
october 2015 by jm
Sorting out graph processing
Some nice real-world experimentation around large-scale data processing in differential dataflow:
If you wanted to do an iterative graph computation like PageRank, it would literally be faster to sort the edges from scratch each and every iteration, than to use unsorted edges. If you want to do graph computation, please sort your edges.

Actually, you know what: if you want to do any big data computation, please sort your records. Stop talking sass about how Hadoop sorts things it doesn't need to, read some papers, run some tests, and then sort your damned data. Or at least run faster than me when I sort your data for you.
algorithms  graphs  coding  data-processing  big-data  differential-dataflow  radix-sort  sorting  x-stream  counting-sort  pagerank 
august 2015 by jm
The world beyond batch: Streaming 101 - O'Reilly Media
To summarize, in this post I’ve:

Clarified terminology, specifically narrowing the definition of “streaming” to apply to execution engines only, while using more descriptive terms like unbounded data and approximate/speculative results for distinct concepts often categorized under the “streaming” umbrella.

Assessed the relative capabilities of well-designed batch and streaming systems, positing that streaming is in fact a strict superset of batch, and that notions like the Lambda Architecture, which are predicated on streaming being inferior to batch, are destined for retirement as streaming systems mature.

Proposed two high-level concepts necessary for streaming systems to both catch up to and ultimately surpass batch, those being correctness and tools for reasoning about time, respectively.

Established the important differences between event time and processing time, characterized the difficulties those differences impose when analyzing data in the context of when they occurred, and proposed a shift in approach away from notions of completeness and toward simply adapting to changes in data over time.

Looked at the major data processing approaches in common use today for bounded and unbounded data, via both batch and streaming engines, roughly categorizing the unbounded approaches into: time-agnostic, approximation, windowing by processing time, and windowing by event time.
streaming  batch  big-data  lambda-architecture  dataflow  event-processing  cep  millwheel  data  data-processing 
august 2015 by jm
"last seen" sketch
a new sketch algorithm from Baron Schwartz and Preetam Jinka of VividCortex; similar to Count-Min but with last-seen timestamp instead of frequency.
sketch  algorithms  estimation  approximation  sampling  streams  big-data 
july 2015 by jm
Discretized Streams: Fault Tolerant Stream Computing at Scale
The paper describing the innards of Spark Streaming and its RDD-based recomputation algorithm:
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.
rdd  spark  streaming  fault-tolerance  batch  distcomp  papers  big-data  scalability 
june 2015 by jm
Adrian Colyer reviews the Twitter Heron paper
ouch, really sounds like Storm didn't cut the muster. 'It’s hard to imagine something more damaging to Apache Storm than this. Having read it through, I’m left with the impression that the paper might as well have been titled “Why Storm Sucks”, which coming from Twitter themselves is quite a statement.'

If I was to summarise the lessons learned, it sounds like: backpressure is required; and multi-tenant architectures suck.

Update: response from Storm dev ptgoetz here: http://blog.acolyer.org/2015/06/15/twitter-heron-stream-processing-at-scale/#comment-1738
storm  twitter  heron  big-data  streaming  realtime  backpressure 
june 2015 by jm
The Violence of Algorithms: Why Big Data Is Only as Smart as Those Who Generate It
The modern state system is built on a bargain between governments and citizens. States provide collective social goods, and in turn, via a system of norms, institutions, regulations, and ethics to hold this power accountable, citizens give states legitimacy. This bargain created order and stability out of what was an increasingly chaotic global system. If algorithms represent a new ungoverned space, a hidden and potentially ever-evolving unknowable public good, then they are an affront to our democratic system, one that requires transparency and accountability in order to function. A node of power that exists outside of these bounds is a threat to the notion of collective governance itself. This, at its core, is a profoundly undemocratic notion—one that states will have to engage with seriously if they are going to remain relevant and legitimate to their digital citizenry who give them their power.
palantir  algorithms  big-data  government  democracy  transparency  accountability  analytics  surveillance  war  privacy  protest  rights 
june 2015 by jm
Elements of Scale: Composing and Scaling Data Platforms
Great, encyclopedic blog post rounding up common architectural and algorithmic patterns using in scalable data platforms. Cut out and keep!
architecture  storage  databases  data  big-data  scaling  scalability  ben-stopford  cqrs  druid  parquet  columnar-stores  lambda-architecture 
may 2015 by jm
Pinball
Pinterest's Hadoop workflow manager; 'scalable, reliable, simple, extensible' apparently. Hopefully it allows upgrades of a workflow component without breaking an existing run in progress, like LinkedIn's Azkaban does :(
python  pinterest  hadoop  workflows  ops  pinball  big-data  scheduling 
april 2015 by jm
RADStack - an open source Lambda Architecture built on Druid, Kafka and Samza
'In this paper we presented the RADStack, a collection of complementary technologies that can be used together to power interactive analytic applications. The key pieces of the stack are Kafka, Samza, Hadoop, and Druid. Druid is designed for exploratory analytics and is optimized for low latency data exploration, aggregation, and ingestion, and is well suited for OLAP workflows. Samza and Hadoop complement Druid and add data processing functionality, and Kafka enables high throughput event delivery.'
druid  samza  kafka  streaming  cep  lambda-architecture  architecture  hadoop  big-data  olap 
april 2015 by jm
"Cuckoo Filter: Practically Better Than Bloom"
'We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership
tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than
Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have
lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'
algorithms  paper  bloom-filters  cuckoo-filters  cuckoo-hashing  data-structures  false-positives  big-data  probabilistic  hashing  set-membership  approximation 
march 2015 by jm
Are you better off running your big-data batch system off your laptop?
Heh, nice trolling.
Here are two helpful guidelines (for largely disjoint populations):

If you are going to use a big data system for yourself, see if it is faster than your laptop.
If you are going to build a big data system for others, see that it is faster than my laptop. [...]

We think everyone should have to do this, because it leads to better systems and better research.
graph  coding  hadoop  spark  giraph  graph-processing  hardware  scalability  big-data  batch  algorithms  pagerank 
january 2015 by jm
Punished for Being Poor: Big Data in the Justice System
This is awful. Totally the wrong tool for the job -- a false positive rate which is miniscule for something like spam filtering, could translate to a really horrible outcome for a human life.
Currently, over 20 states use data-crunching risk-assessment programs for sentencing decisions, usually consisting of proprietary software whose exact methods are unknown, to determine which individuals are most likely to re-offend. The Senate and House are also considering similar tools for federal sentencing. These data programs look at a variety of factors, many of them relatively static, like criminal and employment history, age, gender, education, finances, family background, and residence. Indiana, for example, uses the LSI-R, the legality of which was upheld by the state’s supreme court in 2010. Other states use a model called COMPAS, which uses many of the same variables as LSI-R and even includes high school grades. Others are currently considering the practice as a way to reduce the number of inmates and ensure public safety. (Many more states use or endorse similar assessments when sentencing sex offenders, and the programs have been used in parole hearings for years.) Even the American Law Institute has embraced the practice, adding it to the Model Penal Code, attesting to the tool’s legitimacy.



(via stroan)
via:stroan  statistics  false-positives  big-data  law  law-enforcement  penal-code  risk  sentencing 
august 2014 by jm
173 million 2013 NYC taxi rides shared on BigQuery : bigquery
Interesting! (a) there's a subreddit for Google BigQuery, with links to interesting data sets, like this one; (b) the entire 173-million-row dataset for NYC taxi rides in 2013 is available for querying; and (c) the tip percentage histogram is cool.
datasets  bigquery  sql  google  nyc  new-york  taxis  data  big-data  histograms  tipping 
july 2014 by jm
Google Replaces MapReduce With New Hyper-Scale Cloud Analytics System
MR no more:
“We don’t really use MapReduce anymore,” [Urs] Hölzle said in his keynote presentation at the Google I/O conference in San Francisco Wednesday. The company stopped using the system “years ago.”

Cloud Dataflow, which Google will also offer as a service for developers using its cloud platform, does not have the scaling restrictions of MapReduce. “Cloud Dataflow is the result of over a decade of experience in analytics,” Hölzle said. “It will run faster and scale better than pretty much any other system out there.”

Gossip on the mech-sympathy list says that 'seems that the new platform taking over is a combination of FlumeJava and MillWheel: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf ,
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41378.pdf'
map-reduce  google  hadoop  cloud-dataflow  scalability  big-data  urs-holzle  google-io 
june 2014 by jm
NYC generates hash-anonymised data dump, which gets reversed
There are about 1000*26**3 = 21952000 or 22M possible medallion numbers. So, by calculating the md5 hashes of all these numbers (only 24M!), one can completely deanonymise the entire data. Modern computers are fast: so fast that computing the 24M hashes took less than 2 minutes.


(via Bruce Schneier)

The better fix is a HMAC (see http://benlog.com/2008/06/19/dont-hash-secrets/ ), or just to assign opaque IDs instead of hashing.
hashing  sha1  md5  bruce-schneier  anonymization  deanonymization  security  new-york  nyc  taxis  data  big-data  hmac  keyed-hashing  salting 
june 2014 by jm
Hydra Takes On Hadoop
The intuition behind Hydra is something like this, "I have a lot of data, and there are a lot of things I could try to learn about it -- so many that I'm not even sure what I want to know.” It's about the curse of dimensionality -- more dimensions means exponentially more cost for exhaustive analysis. Hydra tries to make it easy to reduce the number of dimensions, or the cost of watching them (via probabilistic data structures), to just the right point where everything runs quickly but can still answer almost any question you think you might care about.


Code: https://github.com/addthis/hydra

Getting Started blog post: https://www.addthis.com/blog/2014/02/18/getting-started-with-hydra/
hyrda  hadoop  data-processing  big-data  trees  clusters  analysis 
april 2014 by jm
Welcome to Algorithmic Prison - Bill Davidow - The Atlantic
"Computer says no", taken to the next level.
Even if an algorithmic prisoner knows he is in a prison, he may not know who his jailer is. Is he unable to get a loan because of a corrupted file at Experian or Equifax? Or could it be TransUnion? His bank could even have its own algorithms to determine a consumer’s creditworthiness. Just think of the needle-in-a-haystack effort consumers must undertake if they are forced to investigate dozens of consumer-reporting companies, looking for the one that threw them behind algorithmic bars. Now imagine a future that contains hundreds of such companies. A prisoner might not have any idea as to what type of behavior got him sentenced to a jail term. Is he on an enhanced screening list at an airport because of a trip he made to an unstable country, a post on his Facebook page, or a phone call to a friend who has a suspected terrorist friend?
privacy  data  big-data  algorithms  machine-learning  equifax  experian  consumer  society  bill-davidow 
february 2014 by jm
Big doubts on big data: Why I won't be sharing my medical data with anyone - yet
These problems can be circumvented, but they must be dealt with, publically and soberly, if the NHS really does want to win public confidence. The NHS should approach selling the scheme to the public as if was opt-in, not opt-out, then work to convince us to join it. Tell us how sharing our data can help, but tell us what risk too. Let us decide if that balance is worth it. If it's found wanting, the NHS must go back to the drawing board and retool the scheme until it is. It's just too important to get wrong.
nhs  uk  privacy  data-protection  data-privacy  via:mynosql  big-data  healthcare  insurance 
february 2014 by jm
Big, Small, Hot or Cold - Your Data Needs a Robust Pipeline
'(Examples [of big-data B-I crunching pipelines] from Stripe, Tapad, Etsy & Square)'
stripe  tapad  etsy  square  big-data  analytics  kafka  impala  hadoop  hdfs  parquet  thrift 
february 2014 by jm
SAMOA, an open source platform for mining big data streams
Yahoo!'s streaming machine learning platform, built on Storm, implementing:

As a library, SAMOA contains state-of-the-art implementations of algorithms for distributed machine learning on streams. The first alpha release allows classification and clustering. For classification, we implemented a Vertical Hoeffding Tree (VHT), a distributed streaming version of decision trees tailored for sparse data (e.g., text). For clustering, we included a distributed algorithm based on CluStream. The library also includes meta-algorithms such as bagging.
storm  streaming  big-data  realtime  samoa  yahoo  machine-learning  ml  decision-trees  clustering  bagging  classification 
november 2013 by jm
Don't use Hadoop - your data isn't that big
see also HN comments: https://news.ycombinator.com/item?id=6398650 , particularly davidmr's great one:

I suppose all of this is to say that the amount of required parallelization of a problem isn't necessarily related to the size of the problem set as is mentioned most in the article, but also the inherent CPU and IO characteristics of the problem. Some small problems are great for large-scale map-reduce clusters, some huge problems are horrible for even bigger-scale map-reduce clusters (think fluid dynamics or something that requires each subdivision of the problem space to communicate with its neighbors).
I've had a quote printed on my door for years: Supercomputers are an expensive tool for turning CPU-bound problems into IO-bound problems.


I love that quote!
hadoop  big-data  scaling  map-reduce 
september 2013 by jm
Big data is watching you
Some great street art from Brighton, via Darach Ennis
via:darachennis  street-art  graffiti  big-data  snooping  spies  gchq  nsa  art 
september 2013 by jm
Voldemort on Solid State Drives [paper]
'This paper and talk was given by the LinkedIn Voldemort Team at the Workshop on Big Data Benchmarking (WBDB May 2012).'

With SSD, we find that garbage collection will become a very significant bottleneck, especially for systems which have little control over the storage layer and rely on Java memory management. Big heapsizes make the cost of garbage collection expensive, especially the single threaded CMS Initial mark. We believe that data systems must revisit their caching strategies with SSDs. In this regard, SSD has provided an efficient solution for handling fragmentation and moving towards predictable multitenancy.
voldemort  storage  ssd  disk  linkedin  big-data  jvm  tuning  ops  gc 
september 2013 by jm
Streaming MapReduce with Summingbird
Before Summingbird at Twitter, users that wanted to write production streaming aggregations would typically write their logic using a Hadoop DSL like Pig or Scalding. These tools offered nice distributed system abstractions: Pig resembled familiar SQL, while Scalding, like Summingbird, mimics the Scala collections API. By running these jobs on some regular schedule (typically hourly or daily), users could build time series dashboards with very reliable error bounds at the unfortunate cost of high latency.

While using Hadoop for these types of loads is effective, Twitter is about real-time and we needed a general system to deliver data in seconds, not hours. Twitter’s release of Storm made it easy to process data with very low latencies by sacrificing Hadoop’s fault tolerant guarantees. However, we soon realized that running a fully real-time system on Storm was quite difficult for two main reasons:

Recomputation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log loading mechanism;
Storm is focused on message passing and random-write databases are harder to maintain.

The types of aggregations one can perform in Storm are very similar to what’s possible in Hadoop, but the system issues are very different. Summingbird began as an investigation into a hybrid system that could run a streaming aggregation in both Hadoop and Storm, as well as merge automatically without special consideration of the job author. The hybrid model allows most data to be processed by Hadoop and served out of a read-only store. Only data that Hadoop hasn’t yet been able to process (data that falls within the latency window) would be served out of a datastore populated in real-time by Storm. But the error of the real-time layer is bounded, as Hadoop will eventually get around to processing the same data and will smooth out any error introduced. This hybrid model is appealing because you get well understood, transactional behavior from Hadoop, and up to the second additions from Storm. Despite the appeal, the hybrid approach has the following practical problems:

Two sets of aggregation logic have to be kept in sync in two different systems;
Keys and values must be serialized consistently between each system and the client.

The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results
Summingbird was developed to provide a general solution to these problems.


Very interesting stuff. I'm particularly interested in the design constraints they've chosen to impose to achieve this -- data formats which require associative merging in particular.
mapreduce  streaming  big-data  twitter  storm  summingbird  scala  pig  hadoop  aggregation  merging 
september 2013 by jm
hlld
a high-performance C server which is used to expose HyperLogLog sets and operations over them to networked clients. It uses a simple ASCII protocol which is human readable, and similar to memcached.

HyperLogLog's are a relatively new sketching data structure. They are used to estimate cardinality, i.e. the unique number of items in a set. They are based on the observation that any bit in a "good" hash function is indepedenent of any other bit and that the probability of getting a string of N bits all set to the same value is 1/(2^N). There is a lot more in the math, but that is the basic intuition. What is even more incredible is that the storage required to do the counting is log(log(N)). So with a 6 bit register, we can count well into the trillions. For more information, its best to read the papers referenced at the end. TL;DR: HyperLogLogs enable you to have a set with about 1.6% variance, using 3280 bytes, and estimate sizes in the trillions.


(via:cscotta)
hyper-log-log  hlld  hll  data-structures  memcached  daemons  sketching  estimation  big-data  cardinality  algorithms  via:cscotta 
june 2013 by jm
Persuading David Simon (Pinboard Blog)
Maciej Ceglowski with a strongly-argued rebuttal of David Simon's post about the NSA's PRISM. This point in particular is key:
The point is, you don't need human investigators to find leads, you can have the algorithms do it [based on the call graph or network of who-calls-who]. They will find people of interest, assemble the watch lists, and flag whomever you like for further tracking. And since the number of actual terrorists is very, very, very small, the output of these algorithms will consist overwhelmingly of false positives.
false-positives  maciej  privacy  security  nsa  prism  david-simon  accuracy  big-data  filtering  anti-spam 
june 2013 by jm
_Dynamic Histograms: Capturing Evolving Data Sets_ [pdf]

Currently, histograms are static structures: they are created from scratch periodically and their creation is based on looking at the entire data distribution as it exists each time. This creates problems, however, as data stored in DBMSs usually varies with time. If new data arrives at a high rate and old data is likewise deleted, a histogram’s accuracy may deteriorate fast as the histogram becomes older, and the optimizer’s effectiveness may be lost. Hence, how often a histogram is reconstructed becomes very critical, but choosing the right period is a hard problem, as the following trade-off exists: If the period is too long, histograms may become outdated. If the period is too short, updates of the histogram may incur a high overhead.

In this paper, we propose what we believe is the most elegant solution to the problem, i.e., maintaining dynamic histograms within given limits of memory space. Dynamic histograms are continuously updateable, closely tracking changes to the actual data. We consider two of the best static histograms proposed in the literature [9], namely V-Optimal and Compressed, and modify them. The new histograms are naturally called Dynamic V-Optimal (DVO) and Dynamic Compressed (DC). In addition, we modified V-Optimal’s partition constraint to create the Static Average-Deviation Optimal (SADO) and Dynamic Average-Deviation Optimal (DADO) histograms.


(via d2fn)
via:d2fn  histograms  streaming  big-data  data  dvo  dc  sado  dado  dynamic-histograms  papers  toread 
may 2013 by jm
Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing
Yahoo! are going big with Storm for their next-generation internal cloud platform:

'Yahoo! engineering teams are developing technologies to enable Storm applications and Hadoop applications to be hosted on a single cluster.

• We have enhanced Storm to support Hadoop style security mechanism (including Kerberos authentication), and thus enable Storm applications authorized to access Hadoop datasets on HDFS and HBase.
• Storm is being integrated into Hadoop YARN for resource management. Storm-on-YARN enables Storm applications to utilize the computation resources in our tens of thousands of Hadoop computation nodes. YARN is used to launch Storm application master (Nimbus) on demand, and enables Nimbus to request resources for Storm application slaves (Supervisors).'
yahoo  yarn  cloud-computing  private-clouds  big-data  latency  storm  hadoop  elastic-computing  hbase 
february 2013 by jm
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
reasonably good whole-stack performance testing and analysis; HBase, Riak, MongoDB, and Cassandra compared. Riak did pretty badly :(
riak  mongodb  cassandra  hbase  performance  analytics  hadoop  hive  big-data  storage  databases  nosql 
february 2013 by jm
Splout
'Splout is a scalable, open-source, easy-to-manage SQL big data view. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Splout serves a read-only, partitioned SQL view which is generated and indexed by Hadoop.'

Some FAQs: 'What's the difference between Splout SQL and Dremel-like solutions such as BigQuery, Impala or Apache Drill? Splout SQL is not a "fast analytics" Dremel-like engine. It is more thought to be used for serving datasets under web / mobile high-throughput, many lookups, low-latency applications. Splout SQL is more like a NoSQL database in the sense that it has been thought for answering queries under sub-second latencies. It has been thought for performing queries that impact a very small subset of the data, not queries that analyze the whole dataset at once.'
splout  sql  big-data  hadoop  read-only  scaling  queries  analytics 
february 2013 by jm
HBase Real-time Analytics & Rollbacks via Append-based Updates
Interesting concept for scaling up the write rate on massive key-value counter stores:
'Replace update (Get+Put) operations at write time with simple append-only writes and defer processing of updates to periodic jobs or perform aggregations on the fly if user asks for data earlier than individual additions are processed. The idea is simple and not necessarily novel, but given the specific qualities of HBase, namely fast range scans and high write throughput, this approach works very well.'
counters  analytics  hbase  append  sematext  aggregation  big-data 
december 2012 by jm
Authentication is machine learning
This may be the most insightful writing about authentication in years:
<p>
From my brief time at Google, my internship at Yahoo!, and conversations with other companies doing web authentication at scale, I’ve observed that as authentication systems develop they gradually merge with other abuse-fighting systems dealing with various forms of spam (email, account creation, link, etc.) and phishing. Authentication eventually loses its binary nature and becomes a fuzzy classification problem.</p><p>This is not a new observation. It’s generally accepted for banking authentication and some researchers like Dinei Florêncio and Cormac Herley have made it for web passwords. Still, much of the security research community thinks of password authentication in a binary way [..]. Spam and phishing provide insightful examples: technical solutions (like Hashcash, DKIM signing, or EV certificates), have generally failed but in practice machine learning has greatly reduced these problems. The theory has largely held up that with enough data we can train reasonably effective classifiers to solve seemingly intractable problems.
</p>


(via Tony Finch.)
passwords  authentication  big-data  machine-learning  google  abuse  antispam  dkim  via:fanf 
december 2012 by jm
GraphChi
"big data, small machine" -- perform computation on very large graphs using an algorithm they're calling Parallel Sliding Windows. similar to Google's Pregel, apparently
graphs  graphchi  big-data  algorithms  parallel 
july 2012 by jm
Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog
Stream summary, count-min sketches, loglog counting, linear counters. Some nifty algorithms for probabilistic estimation of element frequencies and data-set cardinality (via proggit)
via:proggit  algorithms  probability  probabilistic  count-min  stream-summary  loglog-counting  linear-counting  estimation  big-data 
may 2012 by jm
Operations, machine learning and premature babies - O'Reilly Radar
good post about applying ML techniques to ops data. 'At a recent meetup about finance, Abhi Mehta encouraged people to capture and save "everything." He was talking about financial data, but the same applies here. We'd need to build Hadoop clusters to monitor our server farms; we'd need Hadoop clusters to monitor our Hadoop clusters. It's a big investment of time and resources. If we could make that investment, what would we find out? I bet that we'd be surprised.' Let's just say that if you like the sound of that, our SDE team in Amazon's Dublin office is hiring ;)
ops  big-data  machine-learning  hadoop  ibm 
april 2012 by jm
Occursions
'Our goal is to create the world's fastest extendable, non-transactional time series database for big data (you know, for kids)! Log file indexing is our initial focus. For example append only ASCII files produced by libraries like Log4J, or containing FIX messages or JSON objects. Occursions was built by a small team sick of creating hacks to remotely copy and/or grep through tons of large log files. We use it to index around a terabyte of new log data per day. Occursions asynchronously tails log files and indexes the individual lines in each log file as each line is written to disk so you don't even have to wait for a second after an event happens to search for it. Occursions uses custom disk backed data structures to create and search its indexes so it is very efficient at using CPU, memory and disk.'
logs  search  tsd  big-data  log4j  via:proggit 
march 2012 by jm
How to beat the CAP theorem
Nathan "Storm" Marz on building a dual realtime/batch stack. This lines up with something I've been building in work, so I'm happy ;)
nathan-marz  realtime  batch  hadoop  storm  big-data  cap 
october 2011 by jm
The Secrets of Building Realtime Big Data Systems
great slides, via HN. recommends a canonical Hadoop long-term store and a quick, realtime, separate datastore for "not yet processed by Hadoop" data
hadoop  big-data  data  scalability  datamining  realtime  slides  presentations 
may 2011 by jm

related tags

abuse  accountability  accuracy  aggregation  ai  airbnb  akka  algorithm  algorithms  america  analysis  analytics  android  anonymization  anti-spam  antispam  apache  append  approximation  apps  architecture  art  authentication  backpressure  bagging  batch  ben-stopford  bias  big-data  bigquery  bill-davidow  bloom-filters  bruce-schneier  cap  cardinality  cassandra  cep  classification  cloud-computing  cloud-dataflow  cloudera  clustering  clusters  coding  columnar-stores  computer-says-no  consumer  count-min  counters  counting  counting-sort  cqrs  crime  cuckoo-filters  cuckoo-hashing  dado  daemons  data  data-engineering  data-privacy  data-processing  data-protection  data-science  data-structures  data-warehouse  databases  datadog  dataflow  datamining  datasets  david-simon  dc  deanonymization  decision-trees  democracy  differential-dataflow  discrimination  disk  distcomp  dkim  druid  dvo  dynamic-histograms  elastic-computing  emr  engineering  equifax  estimation  ethics  etsy  event-processing  experian  facebook  faceception  fail  false-positives  fascism  fault-tolerance  filtering  fred-benenson  frequency  funny  gc  gchq  giraph  google  google-io  government  graffiti  graph  graph-processing  graphchi  graphs  hadoop  hardware  hashing  hbase  hdfs  healthcare  heron  histograms  hive  hll  hlld  hmac  hyper-log-log  hyrda  ibm  impala  insurance  ios  iphone  jobs  jvm  kafka  keyed-hashing  lambda-architecture  latency  law  law-enforcement  libel  linear-counting  linkedin  location  log4j  loglog-counting  logs  lying  machine-learning  maciej  map-reduce  mapreduce  math  maths  mathwashing  maxime-beauchemin  md5  memcached  merging  millwheel  minhash  ml  mobile  mongodb  nathan-marz  new-york  nhs  nosql  nsa  nyc  olap  ops  optimization  pagerank  palantir  paper  papers  parallel  parquet  passwords  penal-code  performance  pig  pinball  pinterest  pokemon  politics  polling  presentations  prism  privacy  private-clouds  probabilistic  probability  propaganda  protest  python  queries  quizzes  racism  radix-sort  rdd  read-only  realtime  riak  rights  risk  roles  s3  sado  safegraph  salting  samoa  sampling  samza  scala  scalability  scaling  scheduling  science  search  searching  secrets  security  sematext  sentencing  set-membership  sha1  silicon-valley  sketch  sketches  sketching  slides  smartphones  snooping  society  sorting  spark  spies  splout  sql  square  ssd  statistics  stereotypes  storage  storm  stream-summary  streaming  streams  street-art  stripe  summingbird  surveillance  surveys  swrve  talks  tapad  taxis  technology  thrift  tipping  tips  toread  tracking  transparency  trees  trump  tsd  tuning  twitter  uk  urs-holzle  us-politics  user-data  via:cscotta  via:d2fn  via:darachennis  via:fanf  via:mynosql  via:proggit  via:stroan  voldemort  war  workflows  x-stream  yahoo  yarn 

Copy this bookmark:



description:


tags: