jm + aggregation   11

ASAP: Automatic Smoothing for Attention Prioritization in Streaming Time Series Visualization
Peter Bailis strikes again.

'Time series visualization of streaming telemetry (i.e., charting of
key metrics such as server load over time) is increasingly prevalent
in recent application deployments. Existing systems simply plot the
raw data streams as they arrive, potentially obscuring large-scale
deviations due to local variance and noise. We propose an alternative:
to better prioritize attention in time series exploration and
monitoring visualizations, smooth the time series as much as possible
to remove noise while still retaining large-scale structure. We
develop a new technique for automatically smoothing streaming
time series that adaptively optimizes this trade-off between noise
reduction (i.e., variance) and outlier retention (i.e., kurtosis). We
introduce metrics to quantitatively assess the quality of the choice
of smoothing parameter and provide an efficient streaming analytics
operator, ASAP, that optimizes these metrics by combining techniques
from stream processing, user interface design, and signal
processing via a novel autocorrelation-based pruning strategy and
pixel-aware preaggregation. We demonstrate that ASAP is able to
improve users’ accuracy in identifying significant deviations in time
series by up to 38.4% while reducing response times by up to 44.3%.
Moreover, ASAP delivers these results several orders of magnitude
faster than alternative optimization strategies.'
dataviz  graphs  metrics  peter-bailis  asap  smoothing  aggregation  time-series  tsd 
8 days ago by jm
A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.

Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.
data-structures  algorithms  java  t-digest  statistics  quantiles  percentiles  aggregation  digests  estimation  ranking 
december 2016 by jm
New study shows Spain’s “Google tax” has been a disaster for publishers
A study commissioned by Spanish publishers has found that a new intellectual property law passed in Spain last year, which charges news aggregators like Google for showing snippets and linking to news stories, has done substantial damage to the Spanish news industry.

In the short-term, the study found, the law will cost publishers €10 million, or about $10.9 million, which would fall disproportionately on smaller publishers. Consumers would experience a smaller variety of content, and the law "impedes the ability of innovation to enter the market." The study concludes that there's no "theoretical or empirical justification" for the fee.
google  news  publishing  google-tax  spain  law  aggregation  snippets  economics 
august 2015 by jm
Ask the Decoder: Did I sign up for a global sleep study?
How meaningful is this corporate data science, anyway? Given the tech-savvy people in the Bay Area, Jawbone likely had a very dense sample of Jawbone wearers to draw from for its Napa earthquake analysis. That allowed it to look at proximity to the epicenter of the earthquake from location information.

Jawbone boasts its sample population of roughly “1 million Up wearers who track their sleep using Up by Jawbone.” But when looking into patterns county by county in the U.S., Jawbone states, it takes certain statistical liberties to show granularity while accounting for places where there may not be many Jawbone users.

So while Jawbone data can show us interesting things about sleep patterns across a very large population, we have to remember how selective that population is. Jawbone wearers are people who can afford a $129 wearable fitness gadget and the smartphone or computer to interact with the output from the device.

Jawbone is sharing what it learns with the public, but think of all the public health interests or other third parties that might be interested in other research questions from a large scale data set. Yet this data is not collected with scientific processes and controls and is not treated with the rigor and scrutiny that a scientific study requires.

Jawbone and other fitness trackers don’t give us the option to use their devices while opting out of contributing to the anonymous data sets they publish. Maybe that ought to change.
jawbone  privacy  data-protection  anonymization  aggregation  data  medicine  health  earthquakes  statistics  iot  wearables 
march 2015 by jm
When data gets creepy: the secrets we don’t realise we’re giving away | Technology | The Guardian
Very good article around the privacy implications of derived and inferred aggregate metadata from Ben Goldacre.
We are entering an age – which we should welcome with open arms – when patients will finally have access to their own full medical records online. So suddenly we have a new problem. One day, you log in to your medical records, and there’s a new entry on your file: “Likely to die in the next year.” We spend a lot of time teaching medical students to be skilful around breaking bad news. A box ticked on your medical records is not empathic communication. Would we hide the box? Is that ethical? Or are “derived variables” such as these, on a medical record, something doctors should share like anything else?
advertising  ethics  privacy  security  law  data  aggregation  metadata  ben-goldacre 
december 2014 by jm
Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale | Engineering Blog | Facebook Code
Excellent real-world war story from Facebook -- a long-running mystery bug was eventually revealed to be a combination of edge-case behaviours across all the layers of the networking stack, from L2 link aggregation at the agg-router level, up to the L7 behaviour of the MySQL client connection pool.
Facebook collocates many of a user’s nodes and edges in the social graph. That means that when somebody logs in after a while and their data isn’t in the cache, we might suddenly perform 50 or 100 database queries to a single database to load their data. This starts a race among those queries. The queries that go over a congested link will lose the race reliably, even if only by a few milliseconds. That loss makes them the most recently used when they are put back in the pool. The effect is that during a query burst we stack the deck against ourselves, putting all of the congested connections at the top of the deck.
architecture  debugging  devops  facebook  layer-7  mysql  connection-pooling  aggregation  networking  tcp-stack 
november 2014 by jm
Twitter's TSAR
TSAR = "Time Series AggregatoR". Twitter's new event processor-style architecture for internal metrics. It's notable that now Twitter and Google are both apparently moving towards this idea of a model of code which is designed to run equally in realtime streaming and batch modes (Summingbird, Millwheel, Flume).
analytics  architecture  twitter  tsar  aggregation  event-processing  metrics  streaming  hadoop  batch 
june 2014 by jm
Streaming MapReduce with Summingbird
Before Summingbird at Twitter, users that wanted to write production streaming aggregations would typically write their logic using a Hadoop DSL like Pig or Scalding. These tools offered nice distributed system abstractions: Pig resembled familiar SQL, while Scalding, like Summingbird, mimics the Scala collections API. By running these jobs on some regular schedule (typically hourly or daily), users could build time series dashboards with very reliable error bounds at the unfortunate cost of high latency.

While using Hadoop for these types of loads is effective, Twitter is about real-time and we needed a general system to deliver data in seconds, not hours. Twitter’s release of Storm made it easy to process data with very low latencies by sacrificing Hadoop’s fault tolerant guarantees. However, we soon realized that running a fully real-time system on Storm was quite difficult for two main reasons:

Recomputation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log loading mechanism;
Storm is focused on message passing and random-write databases are harder to maintain.

The types of aggregations one can perform in Storm are very similar to what’s possible in Hadoop, but the system issues are very different. Summingbird began as an investigation into a hybrid system that could run a streaming aggregation in both Hadoop and Storm, as well as merge automatically without special consideration of the job author. The hybrid model allows most data to be processed by Hadoop and served out of a read-only store. Only data that Hadoop hasn’t yet been able to process (data that falls within the latency window) would be served out of a datastore populated in real-time by Storm. But the error of the real-time layer is bounded, as Hadoop will eventually get around to processing the same data and will smooth out any error introduced. This hybrid model is appealing because you get well understood, transactional behavior from Hadoop, and up to the second additions from Storm. Despite the appeal, the hybrid approach has the following practical problems:

Two sets of aggregation logic have to be kept in sync in two different systems;
Keys and values must be serialized consistently between each system and the client.

The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results
Summingbird was developed to provide a general solution to these problems.

Very interesting stuff. I'm particularly interested in the design constraints they've chosen to impose to achieve this -- data formats which require associative merging in particular.
mapreduce  streaming  big-data  twitter  storm  summingbird  scala  pig  hadoop  aggregation  merging 
september 2013 by jm
Distributed Streams Algorithms for Sliding Windows [PDF]
'Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate
functions over a “sliding window” of the N most recent data items in one or more streams. [...] Our results are obtained using a novel family of synopsis data structures called waves.'
waves  papers  streaming  algorithms  percentiles  histogram  distcomp  distributed  aggregation  statistics  estimation  streams 
february 2013 by jm
HBase Real-time Analytics & Rollbacks via Append-based Updates
Interesting concept for scaling up the write rate on massive key-value counter stores:
'Replace update (Get+Put) operations at write time with simple append-only writes and defer processing of updates to periodic jobs or perform aggregations on the fly if user asks for data earlier than individual additions are processed. The idea is simple and not necessarily novel, but given the specific qualities of HBase, namely fast range scans and high write throughput, this approach works very well.'
counters  analytics  hbase  append  sematext  aggregation  big-data 
december 2012 by jm
'Ireland' Reddit
quite a busy news aggregator, it turns out -- 1,960 registered readers and a lot of traffic at this stage. worth bookmarking
ireland  reddit  local  news  aggregation  from delicious
october 2010 by jm

Copy this bookmark: