jm + analytics   31

The great British Brexit robbery: how our democracy was hijacked | Technology | The Guardian

A map shown to the Observer showing the many places in the world where SCL and Cambridge Analytica have worked includes Russia, Lithuania, Latvia, Ukraine, Iran and Moldova. Multiple Cambridge Analytica sources have revealed other links to Russia, including trips to the country, meetings with executives from Russian state-owned companies, and references by SCL employees to working for Russian entities.

Article 50 has been triggered. AggregateIQ is outside British jurisdiction. The Electoral Commission is powerless. And another election, with these same rules, is just a month away. It is not that the authorities don’t know there is cause for concern. The Observer has learned that the Crown Prosecution Service did appoint a special prosecutor to assess whether there was a case for a criminal investigation into whether campaign finance laws were broken. The CPS referred it back to the electoral commission. Someone close to the intelligence select committee tells me that “work is being done” on potential Russian interference in the referendum.

Gavin Millar, a QC and expert in electoral law, described the situation as “highly disturbing”. He believes the only way to find the truth would be to hold a public inquiry. But a government would need to call it. A government that has just triggered an election specifically to shore up its power base. An election designed to set us into permanent alignment with Trump’s America. [....]

This isn’t about Remain or Leave. It goes far beyond party politics. It’s about the first step into a brave, new, increasingly undemocratic world.
elections  brexit  trump  cambridge-analytica  aggregateiq  scary  analytics  data  targeting  scl  ukip  democracy  grim-meathook-future 
20 days ago by jm
pachyderm
'Containerized Data Analytics':
There are two bold new ideas in Pachyderm:

Containers as the core processing primitive
Version Control for data

These ideas lead directly to a system that's much more powerful, flexible and easy to use.

To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).

Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, can track changes and diffs, collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!

Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!
analytics  data  containers  golang  pachyderm  tools  data-science  docker  version-control 
february 2017 by jm
The Fall of BIG DATA – arg min blog
Strongly agreed with this -- particularly the second of the three major failures, specifically:
Our community has developed remarkably effective tools to microtarget advertisements. But if you use ad models to deliver news, that’s propaganda. And just because we didn’t intend to spread rampant misinformation doesn’t mean we are not responsible.
big-data  analytics  data-science  statistics  us-politics  trump  data  science  propaganda  facebook  silicon-valley 
november 2016 by jm
ClickHouse — open-source distributed column-oriented DBMS
'ClickHouse manages extremely large volumes of data in a stable and sustainable manner. It currently powers Yandex.Metrica, world’s second largest web analytics platform, with over 13 trillion database records and over 20 billion events a day, generating customized reports on-the-fly, directly from non-aggregated data. This system was successfully implemented at CERN’s LHCb experiment to store and process metadata on 10bn events with over 1000 attributes per event registered in 2011.'

Yandex-tastic, but still looks really interesting
yandex  analytics  database  storage  sql  clickhouse 
june 2016 by jm
The Totally Managed Analytics Pipeline: Segment, Lambda, and Dynamo
notable mainly for the details of Terraform support for Lambda: that's a significant improvement to Lambda's production-readiness
aws  pipelines  data  streaming  lambda  dynamodb  analytics  terraform  ops 
october 2015 by jm
Scaling Analytics at Amplitude
Good blog post on Amplitude's lambda architecture setup, based on S3 and a custom "real-time set database" they wrote themselves.

antirez' comment from a Redis angle on the set database: http://antirez.com/news/92

HN thread: https://news.ycombinator.com/item?id=10118413
lambda-architecture  analytics  via:hn  redis  set-storage  storage  databases  architecture  s3  realtime 
august 2015 by jm
jgc on Cloudflare's log pipeline
Cloudflare are running a 40-machine, 50TB Kafka cluster, ingesting at 15 Gbps, for log processing. Also: Go producers/consumers, capnproto as wire format, and CitusDB/Postgres to store rolled-up analytics output. Also using Space Saver (top-k) and HLL (counting) estimation algorithms.
logs  cloudflare  kafka  go  capnproto  architecture  citusdb  postgres  analytics  streaming 
june 2015 by jm
Twitter ditches Storm
in favour of a proprietary ground-up rewrite called Heron. Reading between the lines it sounds like Storm had problems with latency, reliability, data loss, and supporting back pressure.
analytics  architecture  twitter  storm  heron  backpressure  streaming  realtime  queueing 
june 2015 by jm
The Violence of Algorithms: Why Big Data Is Only as Smart as Those Who Generate It
The modern state system is built on a bargain between governments and citizens. States provide collective social goods, and in turn, via a system of norms, institutions, regulations, and ethics to hold this power accountable, citizens give states legitimacy. This bargain created order and stability out of what was an increasingly chaotic global system. If algorithms represent a new ungoverned space, a hidden and potentially ever-evolving unknowable public good, then they are an affront to our democratic system, one that requires transparency and accountability in order to function. A node of power that exists outside of these bounds is a threat to the notion of collective governance itself. This, at its core, is a profoundly undemocratic notion—one that states will have to engage with seriously if they are going to remain relevant and legitimate to their digital citizenry who give them their power.
palantir  algorithms  big-data  government  democracy  transparency  accountability  analytics  surveillance  war  privacy  protest  rights 
june 2015 by jm
One year of InfluxDB and the road to 1.0
half of the [Monitorama] attendees were employees and entrepreneurs at monitoring, metrics, DevOps, and server analytics companies. Most of them had a story about how their metrics API was their key intellectual property that took them years to develop. The other half of the attendees were developers at larger organizations that were rolling their own DevOps stack from a collection of open source tools. Almost all of them were creating a “time series database” with a bunch of web services code on top of some other database or just using Graphite. When everyone is repeating the same work, it’s not key intellectual property or a differentiator, it’s a barrier to entry. Not only that, it’s something that is hindering innovation in this space since everyone has to spend their first year or two getting to the point where they can start building something real. It’s like building a web company in 1998. You have to spend millions of dollars and a year building infrastructure, racking servers, and getting everything ready before you could run the application. Monitoring and analytics applications should not be like this.
graphite  monitoring  metrics  tsd  time-series  analytics  influxdb  open-source 
february 2015 by jm
Twitter's TSAR
TSAR = "Time Series AggregatoR". Twitter's new event processor-style architecture for internal metrics. It's notable that now Twitter and Google are both apparently moving towards this idea of a model of code which is designed to run equally in realtime streaming and batch modes (Summingbird, Millwheel, Flume).
analytics  architecture  twitter  tsar  aggregation  event-processing  metrics  streaming  hadoop  batch 
june 2014 by jm
Only 0.15 percent of mobile gamers account for 50 percent of all in-game revenue
Nice bit of marketing from the day job:
The group of gamers responsible for half of all in-game revenue in mobile titles is frightening because it is so narrow, according to a survey by Swrve, an established analytics and app marketing firm. About 0.15 percent of mobile gamers contribute 50 percent of all of the in-app purchases generated in free-to-play games.

This means it may even more important than game companies realized in the past to find and retain the users that fall into the category of big spenders, or “whales.” The vast majority of users never spend any money, despite the clever tactics that game publishers have developed to incentivize people to spend money in their favorite games.
swrve  whales  gaming  games  iap  money  mobile  analytics 
february 2014 by jm
Big, Small, Hot or Cold - Your Data Needs a Robust Pipeline
'(Examples [of big-data B-I crunching pipelines] from Stripe, Tapad, Etsy & Square)'
stripe  tapad  etsy  square  big-data  analytics  kafka  impala  hadoop  hdfs  parquet  thrift 
february 2014 by jm
Cloudera Impala 1.0: It’s Here, It’s Real, It’s Already the Standard for SQL on Hadoop
we are proud to announce the first production drop of Impala, which reflects feedback from across the user community based on multiple types of real-world workloads. Just as a refresher, the main design principle behind Impala is complete integration with the Hadoop platform (jointly utilizing a single pool of storage, metadata model, security framework, and set of system resources). This integration allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL delays needed.


Along with some great benchmark numbers against Hive. nifty stuff
cloudera  impala  sql  querying  etl  olap  hadoop  analytics  business-intelligence  reports 
may 2013 by jm
Metric Collection and Storage with Cassandra | DataStax
DataStax' documentation on how they store TSD data in Cass. Pretty generic
datastax  nosql  metrics  analytics  cassandra  tsd  time-series  storage 
march 2013 by jm
OscarGodson.js | What I Learned At Yammer
some pretty interesting lessons, it turns out: a 'take what you need' vacation policy means nobody takes vacations (unsurprising); Yammer actively work to avoid employee burnout (good idea); Yammer A/B test every feature; and Yammer mgmt try to let their devs work autonomously.
yammer  startups  testing  analytics  culture  work 
march 2013 by jm
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
reasonably good whole-stack performance testing and analysis; HBase, Riak, MongoDB, and Cassandra compared. Riak did pretty badly :(
riak  mongodb  cassandra  hbase  performance  analytics  hadoop  hive  big-data  storage  databases  nosql 
february 2013 by jm
Splout
'Splout is a scalable, open-source, easy-to-manage SQL big data view. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Splout serves a read-only, partitioned SQL view which is generated and indexed by Hadoop.'

Some FAQs: 'What's the difference between Splout SQL and Dremel-like solutions such as BigQuery, Impala or Apache Drill? Splout SQL is not a "fast analytics" Dremel-like engine. It is more thought to be used for serving datasets under web / mobile high-throughput, many lookups, low-latency applications. Splout SQL is more like a NoSQL database in the sense that it has been thought for answering queries under sub-second latencies. It has been thought for performing queries that impact a very small subset of the data, not queries that analyze the whole dataset at once.'
splout  sql  big-data  hadoop  read-only  scaling  queries  analytics 
february 2013 by jm
'The Unified Logging Infrastructure for Data Analytics at Twitter' [PDF]
A picture of how Twitter standardized their internal service event logging formats to allow batch analysis and analytics. They surface service metrics to dashboards from Pig jobs on a daily basis, which frankly doesn't sound too great...
twitter  analytics  event-logging  events  logging  metrics 
january 2013 by jm
Dan McKinley :: Whom the Gods Would Destroy, They First Give Real-time Analytics
'It's important to divorce the concepts of operational metrics and product analytics. [..] Funny business with timeframes can coerce most A/B tests into statistical significance.' 'The truth is that there are very few product decisions that can be made in real time.'

HN discussion: http://news.ycombinator.com/item?id=5032588
real-time  analytics  statistics  a-b-testing 
january 2013 by jm
Scaling Crashlytics: Building Analytics on Redis 2.6
How one analytics/metrics co is using Redis on the backend
analytics  redis  presentation  metrics 
january 2013 by jm
HBase Real-time Analytics & Rollbacks via Append-based Updates
Interesting concept for scaling up the write rate on massive key-value counter stores:
'Replace update (Get+Put) operations at write time with simple append-only writes and defer processing of updates to periodic jobs or perform aggregations on the fly if user asks for data earlier than individual additions are processed. The idea is simple and not necessarily novel, but given the specific qualities of HBase, namely fast range scans and high write throughput, this approach works very well.'
counters  analytics  hbase  append  sematext  aggregation  big-data 
december 2012 by jm
The innards of Evernote's new business analytics data warehouse
replacing a giant MySQL star-schema reporting server with a Hadoop/Hive/ParAccel cluster
horizontal-scaling  scalability  bi  analytics  reporting  evernote  via:highscalability  hive  hadoop  paraccel 
december 2012 by jm
What can data scientists learn from DevOps?
Interesting.

'Rather than continuing to pretend analysis is a one-time, ad hoc action, automate it. [...] you need to maintain the automation machinery, but a cost-benefit analysis will show that the effort rapidly pays off — particularly for complex actions such as analysis that are nontrivial to get right.' (via @fintanr)
via:fintanr  data-science  data  automation  devops  analytics  analysis 
november 2012 by jm
Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
Scribe logs events, "ptail" (parallel tail presumably) tails logs from Scribe stores, Puma batch-aggregates, writes to HBase.  Java and Thrift on the backend, PHP in front
facebook  hbase  scalability  performance  hadoop  scribe  events  analytics  architecture  tail  append  from delicious
march 2011 by jm
NoSQL at Twitter (NoSQL EU 2010) [PDF]
specifically, Hadoop and Pig for log/metrics analytics, Cassandra going forward; great preso, lots of detail and code examples. also, impressive number-crunching going on at Twitter
twitter  analytics  cassandra  databases  hadoop  pdf  logs  metrics  number-crunching  nosql  pig  presentation  slides  scribe  from delicious
april 2010 by jm
PeteSearch: How to split up the US
wow. fascinating results from social-network cluster analysis of Facebook, splitting up the entire USA into 7 clusters
clusters  facebook  data  statistics  maps  culture  analytics  datamining  demographics  socialnetworking  graph  dataviz  from delicious
february 2010 by jm
CloudSplit – Real Time Cloud Analytics
interesting idea from Joe -- track your cloud-hosting spend in real-time
cloudsplit  hosting  amazon  ec2  azure  joe-drumgoole  analytics  real-time  from delicious
september 2009 by jm
glTail.rb - realtime logfile visualization
'View real-time data and statistics from any logfile on any server with SSH, in an intuitive and entertaining way', supporting postfix/spamd/clamd logs among loads of others. very cool if a little silly
dataviz  visualization  tail  gltail  opengl  linux  apache  spamd  spamassassin  logs  statistics  sysadmin  analytics  animation  analysis  server  ruby  monitoring  logging  logfiles 
july 2009 by jm

related tags

a-b-testing  accountability  aggregateiq  aggregation  algorithms  amazon  analysis  analytics  animation  apache  append  architecture  automation  aws  azure  backpressure  batch  bi  big-data  brexit  business-intelligence  cambridge-analytica  capnproto  cassandra  citusdb  clickhouse  cloudera  cloudflare  cloudsplit  clusters  containers  counters  culture  data  data-science  database  databases  datamining  datastax  dataviz  democracy  demographics  devops  docker  dynamodb  ec2  elections  etl  etsy  event-logging  event-processing  events  evernote  facebook  games  gaming  gltail  go  golang  government  graph  graphite  grim-meathook-future  hadoop  hbase  hdfs  heron  hive  horizontal-scaling  hosting  iap  impala  influxdb  joe-drumgoole  kafka  lambda  lambda-architecture  linux  logfiles  logging  logs  maps  metrics  mobile  money  mongodb  monitoring  nosql  number-crunching  olap  open-source  opengl  ops  pachyderm  palantir  paraccel  parquet  pdf  performance  pig  pipelines  postgres  presentation  privacy  propaganda  protest  queries  querying  queueing  rdbms  read-only  real-time  realtime  redis  redshift  reporting  reports  riak  rights  ruby  s3  scalability  scaling  scary  science  scl  scribe  sematext  server  set-storage  silicon-valley  slides  socialnetworking  spamassassin  spamd  spark-streaming  splout  sql  square  startups  statistics  storage  storm  streaming  stripe  surveillance  swrve  sysadmin  tail  tapad  targeting  terraform  testing  thrift  time-series  tools  transparency  trump  tsar  tsd  twitter  ukip  us-politics  version-control  via:fintanr  via:highscalability  via:hn  visualization  war  whales  work  yammer  yandex 

Copy this bookmark:



description:


tags: