jm + papers   86

When DNNs go wrong – adversarial examples and what we can learn from them
Excellent paper.
[The] results suggest that classifiers based on modern machine learning techniques, even those that obtain excellent performance on the test set, are not learning the true underlying concepts that determine the correct output label. Instead, these algorithms have built a Potemkin village that works well on naturally occuring data, but is exposed as a fake when one visits points in space that do not have high probability in the data distribution.
ai  deep-learning  dnns  neural-networks  adversarial-classification  classification  classifiers  machine-learning  papers 
29 days ago by jm
'Software Engineering at Google'
20 pages of Google's software dev practices, with emphasis on the build system (since it was written by the guy behind Blaze). Naturally, some don't make a whole lot of sense outside of Google, but still some good stuff here
development  engineering  google  papers  software  coding  best-practices 
6 weeks ago by jm
Slicer: Auto-sharding for datacenter applications
Paper from Google describing one of their internal building block services:
A general purpose sharding service. I normally think of sharding as something that happens within a (typically data) service, not as a general purpose infrastructure service. What exactly is Slicer then? It has two key components: a data plane that acts as an affinity-aware load balancer, with affinity managed based on application-specified keys; and a control plane that monitors load and instructs applications processes as to which keys they should be serving at any one point in time.  In this way, the decisions regarding how to balance keys across application instances can be outsourced to the Slicer service rather than building this logic over and over again for each individual back-end service. Slicer is focused exclusively on the problem of balancing load across a given set of  backend tasks, other systems are responsible for adding and removing tasks.


interesting.
google  sharding  slicer  architecture  papers 
december 2016 by jm
Simple testing can prevent most critical failures
Specifically, the following 3 classes of errors were implicated in 92% of the major production outages in this study and could have been caught with simple code review:
Error handlers that ignore errors (or just contain a log statement); error handlers with “TODO” or “FIXME” in the comment; and error handlers that catch an abstract exception type (e.g. Exception or Throwable in Java) and then take drastic action such as aborting the system.


(Interestingly, the latter was a particular favourite approach of some misplaced "fail fast"/"crash-only software design" dogma in Amazon. I wasn't a fan)
fail-fast  crash-only-software  coding  design  bugs  code-review  review  outages  papers  logging  errors  exceptions 
october 2016 by jm
MRI software bugs could upend years of research - The Register
In their paper at PNAS, they write: “the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.”

For example, a bug that's been sitting in a package called 3dClustSim for 15 years, fixed in May 2015, produced bad results (3dClustSim is part of the AFNI suite; the others are SPM and FSL). That's not a gentle nudge that some results might be overstated: it's more like making a bonfire of thousands of scientific papers.

Further: “Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape”.

The researchers used published fMRI results, and along the way they swipe the fMRI community for their “lamentable archiving and data-sharing practices” that prevent most of the discipline's body of work being re-analysed. ®
fmri  science  mri  statistics  cluster-inference  autocorrelation  data  papers  medicine  false-positives  fps  neuroimaging 
july 2016 by jm
Why do Selenium-style record/replay tests of web applications break?
good data! Mostly because of element locations it seems....
selenium  testing  web  locators  papers  qa  tests 
may 2016 by jm
PLOS ONE: Tyrannobdella rex N. Gen. N. Sp. and the Evolutionary Origins of Mucosal Leech Infestations
Today in nose-leech news -- the paper!
Principal Findings: A new genus and species of leech from Perú was found feeding from the nasopharynx of humans. Unlike any other leech previously described, this new taxon has but a single jaw with very large teeth. Phylogenetic analyses of nuclear and mitochondrial genes using parsimony and Bayesian inference demonstrate that the new species belongs among a larger, global clade of leeches, all of which feed from the mucosal surfaces of mammals.

Conclusions: This new species, found feeding from the upper respiratory tract of humans in Perú, clarifies an expansion of the family Praobdellidae to include the new species Tyrannobdella rex n. gen. n.sp., along with others in the genera Dinobdella, Myxobdella, Praobdella and Pintobdella. Moreover, the results clarify a single evolutionary origin of a group of leeches that specializes on mucous membranes, thus, posing a distinct threat to human health.
leeches  nose-leech  papers  science  species  tyrannobdella-rex  horror 
may 2016 by jm
Public preferences for electronic health data storage, access, and sharing – evidence from a pan-European survey | Journal of the American Medical Informatics Association
Results: We obtained 20 882 survey responses (94 606 preferences) from 27 EU member countries. Respondents recognized the benefits of storing electronic health information, with 75.5%, 63.9%, and 58.9% agreeing that storage was important for improving treatment quality, preventing epidemics, and reducing delays, respectively. Concerns about different levels of access by third parties were expressed by 48.9% to 60.6% of respondents. On average, compared to devices or systems that only store basic health status information, respondents preferred devices that also store identification data (coefficient/relative preference 95% CI = 0.04 [0.00-0.08], P = 0.034) and information on lifelong health conditions (coefficient = 0.13 [0.08 to 0.18], P < 0.001), but there was no evidence of this for devices with information on sensitive health conditions such as mental and sexual health and addictions (coefficient = −0.03 [−0.09 to 0.02], P = 0.24). Respondents were averse to their immediate family (coefficient = −0.05 [−0.05 to −0.01], P = 0.011) and home care nurses (coefficient = −0.06 [−0.11 to −0.02], P = 0.004) viewing this data, and strongly averse to health insurance companies (coefficient = −0.43 [−0.52 to 0.34], P < 0.001), private sector pharmaceutical companies (coefficient = −0.82 [−0.99 to −0.64], P < 0.001), and academic researchers (coefficient = −0.53 [−0.66 to −0.40], P < 0.001) viewing the data.

Conclusions: Storing more detailed electronic health data was generally preferred, but respondents were averse to wider access to and sharing of this information. When developing frameworks for the use of electronic health data, policy makers should consider approaches that both highlight the benefits to the individual and minimize the perception of privacy risks.


Via Antoin.
privacy  data  medicine  health  healthcare  papers  via:antoin 
april 2016 by jm
The Rise of Pirate Libraries
The history of this is fascinating:
Today’s pirate libraries have their roots in the work of Russian academics to digitize texts in the 1990s. Scholars in that part of the world had long had a thriving practice of passing literature and scientific information underground, in opposition to government censorship—part of the samizdat culture, in which banned documents were copied and passed hand to hand through illicit channels. Those first digital collections were passed freely around, but when their creators started running into problems with copyright, their collections “retreated from the public view,” writes Balázs Bodó, a piracy researcher based at the University of Amsterdam. “The text collections were far too valuable to simply delete,” he writes, and instead migrated to “closed, membership-only FTP servers.” [....]

There’s always been osmosis within the academic community of copyrighted materials from people with access to scholar without. “Much of the life of a research academic in Kazakhstan or Iran or Malaysia involves this informal diffusion of materials across the gated walls of the top universities,” he says.
pirates  pirate-libraries  libraries  archival  history  russia  ussr  samizdat  samizdata  academia  papers 
april 2016 by jm
The science behind "don't drink when pregnant" is rubbish
As the economist Emily Oster pointed out in her 2013 book Expecting Better, there is also no “proven safe” level of Tylenol or caffeine, and yet both are fine in moderation during pregnancy. Oster pored through reams of research on alcohol and pregnancy for her book and concluded that there is simply no scientific evidence that light drinking during pregnancy impacts a baby’s health. (In one frequently cited 2001 study that suggested light drinking in pregnancy increases the chances of a child displaying aggressive behaviors, the drinkers were also significantly likelier to have taken cocaine during pregnancy.)


My wife also followed the paper trail on this issue in the past. In the papers from which these recommendations were derived, the level of drinking at which any effects were observed in babies was when women consumed at least *9 units every day* for the entire pregnancy. That's an entire bottle of wine, daily!
booze  alcohol  science  facts  papers  medicine  emily-oster  babies  pregnancy  pre-pregnant  research 
february 2016 by jm
Files Are Hard
This is basically terrifying. A catalog of race conditions and reliability horrors around the POSIX filesystem abstraction in Linux -- it's a wonder anything works.

'Where’s this documented? Oh, in some mailing list post 6-8 years ago (which makes it 12-14 years from today). The fs devs whose posts I’ve read are quite polite compared to LKML’s reputation, and they generously spend a lot of time responding to basic questions, but it’s hard for outsiders to troll [sic] through a decade and a half of mailing list postings to figure out which ones are still valid and which ones have been obsoleted! I don’t mean to pick on filesystem devs. In their OSDI 2014 talk, the authors of the paper we’re discussing noted that when they reported bugs they’d found, developers would often respond “POSIX doesn’t let filesystems do that”, without being able to point to any specific POSIX documentation to support their statement. If you’ve followed Kyle Kingsbury’s Jepsen work, this may sound familiar, except devs respond with “filesystems don’t do that” instead of “networks don’t do that”.I think this is understandable, given how much misinformation is out there. Not being a filesystem dev myself, I’d be a bit surprised if I don’t have at least one bug in this post.'
filesystems  linux  unix  files  operating-systems  posix  fsync  osdi  papers  reliability 
december 2015 by jm
"Hidden Technical Debt in Machine-Learning Systems" [pdf]
Another great paper about from Google, talking about the tradeoffs that must be considered in practice over the long term with running a complex ML system in production.
technical-debt  ml  machine-learning  ops  software  production  papers  pdf  google 
december 2015 by jm
The impact of Docker containers on the performance of genomic pipelines [PeerJ]
In this paper, we have assessed the impact of Docker containers technology on the performance of genomic pipelines, showing that container “virtualization” has a negligible overhead on pipeline performance when it is composed of medium/long running tasks, which is the most common scenario in computational genomic pipelines.

Interestingly for these tasks the observed standard deviation is smaller when running with Docker. This suggests that the execution with containers is more “homogeneous,” presumably due to the isolation provided by the container environment.

The performance degradation is more significant for pipelines where most of the tasks have a fine or very fine granularity (a few seconds or milliseconds). In this case, the container instantiation time, though small, cannot be ignored and produces a perceptible loss of performance.
performance  docker  ops  genomics  papers 
november 2015 by jm
An Analysis of Reshipping Mule Scams
We observed that the vast majority of the re-shipped packages end up in the Moscow, Russia area, and that the goods purchased with stolen credit cards span multiple categories, from expensive electronics such as Apple products, to designer clothes, to DSLR cameras and even weapon accessories. Given the amount of goods shipped by the reshipping mule sites that we analysed, the annual revenue generated from such operations can span between 1.8 and 7.3 million US dollars. The overall losses are much higher though: the online merchant loses an expensive item from its inventory and typically has to refund the owner of the stolen credit card. In addition, the rogue goods typically travel labeled as “second hand goods” and therefore custom taxes are also evaded. Once the items purchased with stolen credit cards reach their destination they will be sold on the black market by cybercriminals. [...] When applying for the job, people are usually required to send the operator copies of their ID cards and passport. After they are hired, mules are promised to be paid at the end of their first month of employment. However, from our data it is clear that mules are usually never paid. After their first month expires, they are never contacted back by the operator, who just moves on and hires new mules. In other words, the mules become victims of this scam themselves, by never seeing a penny. Moreover, because they sent copies of their documents to the criminals, mules can potentially become victims of identity theft.
crime  law  cybercrime  mules  shipping-scams  identity-theft  russia  moscow  scams  papers 
november 2015 by jm
Existential Consistency: Measuring and Understanding Consistency at Facebook
The metric is termed φ(P)-consistency, and is actually very simple. A read for the same data is sent to all replicas in P, and φ(P)-consistency is defined as the frequency with which that read returns the same result from all replicas. φ(G)-consistency applies this metric globally, and φ(R)-consistency applies it within a region (cluster). Facebook have been tracking this metric in production since 2012.
facebook  eventual-consistency  consistency  metrics  papers  cap  distributed-computing 
october 2015 by jm
Using Samsung's Internet-Enabled Refrigerator for Man-in-the-Middle Attacks
Whilst the fridge implements SSL, it FAILS to validate SSL certificates, thereby enabling man-in-the-middle attacks against most connections. This includes those made to Google's servers to download Gmail calendar information for the on-screen display. So, MITM the victim's fridge from next door, or on the road outside and you can potentially steal their Google credentials.


The Internet of Insecure Things strikes again.
iot  security  fridges  samsung  fail  mitm  ssl  tls  google  papers  defcon 
september 2015 by jm
Mining High-Speed Data Streams: The Hoeffding Tree Algorithm
This paper proposes a decision tree learner for data streams, the Hoeffding Tree algorithm, which comes with the guarantee that the learned decision tree is asymptotically nearly identical to that of a non-incremental learner using infinitely many examples. This work constitutes a significant step in developing methodology suitable for modern ‘big data’ challenges and has initiated a lot of follow-up research. The Hoeffding Tree algorithm has been covered in various textbooks and is available in several public domain tools, including the WEKA Data Mining platform.
hoeffding-tree  algorithms  data-structures  streaming  streams  cep  decision-trees  ml  learning  papers 
august 2015 by jm
GSMem: Data Exfiltration from Air-Gapped Computers over GSM Frequencies
Holy shit.
Air-gapped networks are isolated, separated both logically and physically from public networks. Although the feasibility of invading such systems has been demonstrated in recent years, exfiltration of data from air-gapped networks is still a challenging task. In this paper we present GSMem, a malware that can exfiltrate data through an air-gap over cellular frequencies. Rogue software on an infected target computer modulates and transmits electromagnetic signals at cellular frequencies by invoking specific memory-related instructions and utilizing the multichannel memory architecture to amplify the transmission. Furthermore, we show that the transmitted signals can be received and demodulated by a rootkit placed in the baseband firmware of a nearby cellular phone.
gsmem  gsm  exfiltration  air-gaps  memory  radio  mobile-phones  security  papers 
august 2015 by jm
Somewhere Over the Rainbow: How to Make Effective Use of Colors in Meteorological Visualizations
Linked from the "Improving the Weather On Twitter" post -- choosing the "best" colour scheme for meteorological visualization. Great dataviz resource post
dataviz  colour  color  meteorological  weather  nws  papers  rgb  hcl 
august 2015 by jm
Men who harass women online are quite literally losers, new study finds
(1) players are anonymous, and the possibility of “policing individual behavior is almost impossible”; (2) they only encounter each other a few times in passing — it’s very possible to hurl an expletive at another player, and never “see” him or her again; and (3) finally, and perhaps predictably, the sex-ratio of players is biased pretty heavily toward men. (A 2014 survey of gender ratios on Reddit found that r/halo was over 95 percent male.) [....]

In each of these environments, Kasumovic suggests, a recent influx of female participants has disrupted a pre-existing social hierarchy. That’s okay for the guys at the top — but for the guys at the bottom, who stand to lose more status, that’s very threatening. (It’s also in keeping with the evolutionary framework on anti-lady hostility, which suggests sexism is a kind of Neanderthal defense mechanism for low-status, non-dominant men trying to maintain a shaky grip on their particular cave’s supply of women.)

“As men often rely on aggression to maintain their dominant social status,” Kasumovic writes, “the increase in hostility towards a woman by lower-status males may be an attempt to disregard a female’s performance and suppress her disturbance on the hierarchy to retain their social rank.”
losers  sexism  mysogyny  women  halo  gaming  gamergate  4chan  abuse  harrassment  papers  bullying  social-status 
july 2015 by jm
Discretized Streams: Fault Tolerant Stream Computing at Scale
The paper describing the innards of Spark Streaming and its RDD-based recomputation algorithm:
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.
rdd  spark  streaming  fault-tolerance  batch  distcomp  papers  big-data  scalability 
june 2015 by jm
Evidence-Based Software Engineering

Objective: Our objective is to describe how software engineering might benefit from an evidence-based approach and to identify the potential difficulties associated with the approach.
Method: We compared the organisation and technical infrastructure supporting evidence-based medicine (EBM) with the situation in software engineering. We considered the impact that factors peculiar to software engineering (i.e. the skill factor and the lifecycle factor) would have on our ability to practice evidence-based software engineering (EBSE).
Results: EBSE promises a number of benefits by encouraging integration of research results with a view to supporting the needs of many different stakeholder groups. However, we do not currently have the infrastructure needed for widespread adoption of EBSE. The skill factor means software engineering experiments are vulnerable to subject and experimenter bias. The lifecycle factor means it is difficult to determine how technologies will behave once deployed.
Conclusions: Software engineering would benefit from adopting what it can of the evidence approach provided that it deals with the specific problems that arise from the nature of software engineering.


(via Mark Dennehy)
papers  toread  via:markdennehy  software  coding  ebse  evidence-based-medicine  medicine  research 
june 2015 by jm
Hybrid Logical Clocks
neat substitute for physical-time clocks in synchronization and ordering in a distributed system, based on Lamport's Logical Clocks and Google's TrueTime.

'HLC captures the causality relationship like LC, and enables easy identification of consistent snapshots in distributed systems. Dually, HLC can be used in lieu of PT clocks since it maintains its logical clock to be always close to the PT clock.'
hlc  clocks  logical-clocks  time  synchronization  ordering  events  logs  papers  algorithms  truetime  distcomp 
june 2015 by jm
I Fooled Millions Into Thinking Chocolate Helps Weight Loss
“Slim by Chocolate!” the headlines blared. A team of German researchers had found that people on a low-carb diet lost weight 10 percent faster if they ate a chocolate bar every day. It made the front page of Bild, Europe’s largest daily newspaper, just beneath their update about the Germanwings crash. From there, it ricocheted around the internet and beyond, making news in more than 20 countries and half a dozen languages. It was discussed on television news shows. It appeared in glossy print, most recently in the June issue of Shape magazine (“Why You Must Eat Chocolate Daily”, page 128). Not only does chocolate accelerate weight loss, the study found, but it leads to healthier cholesterol levels and overall increased well-being. The Bild story quotes the study’s lead author, Johannes Bohannon, Ph.D., research director of the Institute of Diet and Health: “The best part is you can buy chocolate everywhere.”

I am Johannes Bohannon, Ph.D. Well, actually my name is John, and I’m a journalist. I do have a Ph.D., but it’s in the molecular biology of bacteria, not humans. The Institute of Diet and Health? That’s nothing more than a website. Other than those fibs, the study was 100 percent authentic. My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.


Interesting bit: the online commenters commenting on the published stories quickly saw through the bullshit. Why can't the churnalising journos do that?
chocolate  journalism  science  diet  food  churnalism  pr  bild  health  clinical-trials  papers  peer-review  research 
may 2015 by jm
Top 10 data mining algorithms in plain English
This is a phenomenally useful ML/data-mining resource post -- 'the top 10 most influential data mining algorithms as voted on by 3 separate panels in [ICDM '06's] survey paper', but with a nice clear intro and description for each one. Here's the algorithms covered:
1. C4.5
2. k-means
3. Support vector machines
4. Apriori
5. EM
6. PageRank
7. AdaBoost
8. kNN
9. Naive Bayes
10. CART
svm  k-means  c4.5  apriori  em  pagerank  adaboost  knn  naive-bayes  cart  ml  data-mining  machine-learning  papers  algorithms  unsupervised  supervised 
may 2015 by jm
"Trash Day: Coordinating Garbage Collection in Distributed Systems"
Another GC-coordination strategy, similar to Blade (qv), with some real-world examples using Cassandra
blade  via:adriancolyer  papers  gc  distsys  algorithms  distributed  java  jvm  latency  spark  cassandra 
may 2015 by jm
_Blade: a Data Center Garbage Collector_
Essentially, add a central GC scheduler to improve tail latencies in a cluster, by taking instances out of the pool to perform slow GC activity instead of letting them impact live operations. I've been toying with this idea for a while, nice to see a solid paper about it
gc  latency  tail-latencies  papers  blade  go  java  scheduling  clustering  load-balancing  low-latency  performance 
april 2015 by jm
Closed access means people die
'We've paid 100 BILLION USD over the last 10 years to "publish" science and medicine. Ebola is a massive systems failure.' See also https://www.techdirt.com/articles/20150409/17514230608/dont-think-open-access-is-important-it-might-have-prevented-much-ebola-outbreak.shtml :

'The conventional wisdom among public health authorities is that the Ebola virus, which killed at least 10,000 people in Liberia, Sierra Leone and Guinea, was a new phenomenon, not seen in West Africa before 2013. [...]
But, as the team discovered, that "conventional wisdom" was wrong. In fact, they found a bunch of studies, buried behind research paywalls, that revealed that there was significant evidence of antibodies to the Ebola virus in Liberia and in other nearby nations. There was one from 1982 that noted: "medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics."
deaths  liberia  ebola  open-access  papers  elsevier  science  medicine  reprints 
april 2015 by jm
Large-scale cluster management at Google with Borg
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.
We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.


(via Conall)
via:conall  clustering  google  papers  scale  to-read  borg  cluster-management  deployment  packing  reliability  redundancy 
april 2015 by jm
Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers
How to build an Intelligent Personal Assistant:

'Sirius is an open end-to-end standalone speech and vision based intelligent personal assistant (IPA) similar to Apple’s Siri, Google’s Google Now, Microsoft’s Cortana, and Amazon’s Echo. Sirius implements the core functionalities of an IPA including speech recognition, image matching, natural language processing and a question-and-answer system. Sirius is developed by Clarity Lab at the University of Michigan. Sirius is published at the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2015.'
sirius  siri  cortana  google-now  echo  ok-google  ipa  assistants  search  video  audio  speech  papers  clarity  nlp  wikipedia 
april 2015 by jm
Combining static model checking with dynamic enforcement using the Statecall Policy Language
This looks quite nice -- a model-checker "for regular programmers". Example model for ping(1):

<pre>01 automaton ping (int max_count, int count, bool can_timeout) {
02 Initialize;
03 during {
04 count = 0;
05 do {
06 Transmit_Ping;
07 either {
08 Receive_Ping;
09 } or (can_timeout) {
10 Timeout_Ping;
11 };
12 count = count + 1;
13 } until (count &gt;= max_count);
14 } handle {
15 SIGINFO;
16 Print_Summary;
17 };</pre>
ping  model-checking  models  formal-methods  verification  static  dynamic  coding  debugging  testing  distcomp  papers 
march 2015 by jm
Services Engineering Reading List
good list of papers/articles for fans of scalability etc.
architecture  papers  reading  reliability  scalability  articles  to-read 
march 2015 by jm
RIPQ: Advanced photo caching on flash for Facebook
Interesting priority-queue algorithm optimised for caching data on SSD
priority-queue  algorithms  facebook  ssd  flash  caching  ripq  papers 
february 2015 by jm
Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth | USENIX
Erasure codes, such as Reed-Solomon (RS) codes, are increasingly being deployed as an alternative to data-replication for fault tolerance in distributed storage systems. While RS codes provide significant savings in storage space, they can impose a huge burden on the I/O and network resources when reconstructing failed or otherwise unavailable data. A recent class of erasure codes, called minimum-storage-regeneration (MSR) codes, has emerged as a superior alternative to the popular RS codes, in that it minimizes network transfers during reconstruction while also being optimal with respect to storage and reliability. However, existing practical MSR codes do not address the increasingly important problem of I/O overhead incurred during reconstructions, and are, in general, inferior to RS codes in this regard. In this paper, we design erasure codes that are simultaneously optimal in terms of I/O, storage, and network bandwidth. Our design builds on top of a class of powerful practical codes, called the product-matrix-MSR codes. Evaluations show that our proposed design results in a significant reduction the number of I/Os consumed during reconstructions (a 5 reduction for typical parameters), while retaining optimality with respect to storage, reliability, and network bandwidth.
erasure-coding  reed-solomon  compression  reliability  reconstruction  replication  fault-tolerance  storage  bandwidth  usenix  papers 
february 2015 by jm
Study: You Can't Change an Anti-Vaxxer's Mind
According to a major new study in the journal 'Pediatrics', trying to [persuade anti-vaxxers to vaccinate] may actually make the problem worse. The paper tested the effectiveness of four separate pro-vaccine messages, three of which were based very closely on how the Centers for Disease Control and Prevention (CDC) itself talks about vaccines. The results can only be called grim: Not a single one of the messages was successful when it came to increasing parents' professed intent to vaccinate their children. And in several cases the messages actually backfired, either increasing the ill-founded belief that vaccines cause autism or even, in one case, apparently reducing parents' intent to vaccinate.
vaccination  health  measles  mmr  autism  facts  via:mrneutron  stupidity  cdc  papers  vaccines 
february 2015 by jm
"Incremental Stream Processing using Computational Conflict-free Replicated Data Types" [paper]
'Unlike existing alternatives, such as stream processing, that favor the execution of arbitrary application code, we want to capture much of the processing logic as a set of known operations over specialized Computational CRDTs, with particular semantics and invariants, such as min/max/average/median registers, accumulators, top-N sets, sorted sets/maps, and so on. Keeping state also allows the system to decrease the amount of propagated information. Preliminary results obtained in a single example show that Titan has an higher throughput when compared with state of the art stream processing systems.'
crdt  distributed  stream-processing  replication  titan  papers 
january 2015 by jm
F1: A Distributed SQL Database That Scales
Beyond the interesting-enough stuff about scalability in a distributed SQL store, there's this really nifty point about avoiding the horrors of the SQL/ORM impedance mismatch:
At Google, Protocol Buffers are ubiquitous for data storage and interchange between applications. When we still had a MySQL schema, users often had to write tedious and error-prone transformations between database rows and in-memory data structures. Putting protocol buffers in the schema removes this impedance mismatch and gives users a universal data structure they can use both in the database and in application code…. Protocol Buffer columns are more natural and reduce semantic complexity for users, who can now read and write their logical business objects as atomic units, without having to think about materializing them using joins across several tables.


This is something that pretty much any store can already adopt. Go protobufs. (or Avro, etc.)

Also, I find this really neat, and I hope this idea is implemented elsewhere soon: asynchronous schema updates:

Schema changes are applied asynchronously on multiple F1 servers. Anomalies are prevented by the use of a schema leasing mechanism with support for only current and next schema versions; and by subdividing schema changes into multiple phases where consecutive pairs of changes are mutually compatible and cannot cause anomalies.
schema  sql  f1  google  papers  orm  protobuf 
january 2015 by jm
'Machine Learning: The High-Interest Credit Card of Technical Debt' [PDF]
Oh god yes. This is absolutely spot on, as you would expect from a Google paper -- at this stage they probably have accumulated more real-world ML-at-scale experience than anywhere else.

'Machine learning offers a fantastically powerful toolkit for building complex systems
quickly. This paper argues that it is dangerous to think of these quick wins
as coming for free. Using the framework of technical debt, we note that it is remarkably
easy to incur massive ongoing maintenance costs at the system level
when applying machine learning. The goal of this paper is highlight several machine
learning specific risk factors and design patterns to be avoided or refactored
where possible. These include boundary erosion, entanglement, hidden feedback
loops, undeclared consumers, data dependencies, changes in the external world,
and a variety of system-level anti-patterns.

[....]

'In this paper, we focus on the system-level interaction between machine learning code and larger systems
as an area where hidden technical debt may rapidly accumulate. At a system-level, a machine
learning model may subtly erode abstraction boundaries. It may be tempting to re-use input signals
in ways that create unintended tight coupling of otherwise disjoint systems. Machine learning
packages may often be treated as black boxes, resulting in large masses of “glue code” or calibration
layers that can lock in assumptions. Changes in the external world may make models or input
signals change behavior in unintended ways, ratcheting up maintenance cost and the burden of any
debt. Even monitoring that the system as a whole is operating as intended may be difficult without
careful design.

Indeed, a remarkable portion of real-world “machine learning” work is devoted to tackling issues
of this form. Paying down technical debt may initially appear less glamorous than research results
usually reported in academic ML conferences. But it is critical for long-term system health and
enables algorithmic advances and other cutting-edge improvements.'
machine-learning  ml  systems  ops  tech-debt  maintainance  google  papers  hidden-costs  development 
december 2014 by jm
Great quote from Voldemort author Jay Kreps
"Reading papers: essential. Slavishly implementing ideas you read: not necessarily a good idea. Trust me, I wrote an Amazon Dynamo clone."

Later in the discussion, on complex conflict resolution logic (as used in Dynamo, Voldemort, and Riak):

"I reviewed 200 Voldemort stores, 190 used default lww conflict resolution. 10 had custom logic, all 10 of which had bugs." -- https://twitter.com/jaykreps/statuses/528292617784537088

(although IMO I'd prefer complex resolution to non-availability, when AP is required)
voldemort  jay-kreps  dynamo  cap-theorem  ap  riak  papers  lww  conflict-resolution  distcomp 
november 2014 by jm
"Quantiles on Streams" [paper, 2009]
'Chiranjeeb Buragohain and Subhash Suri: "Quantiles on Streams" in Encyclopedia of Database Systems, Springer, pp 2235–2240, 2009. ISBN: 978-0-387-35544-3', cited by Martin Kleppman in http://mail-archives.apache.org/mod_mbox/kafka-dev/201402.mbox/%3C131A7649-ED57-45CB-B4D6-F34063267664@linkedin.com%3E as a good, short literature survey re estimating percentiles with a small memory footprint.
latency  percentiles  coding  quantiles  streams  papers  algorithms 
october 2014 by jm
Internet Trolls Are Narcissists, Psychopaths, and Sadists | Psychology Today
The relationship between this Dark Tetrad [of narcissism, Machiavellianism, psychopathy, and sadism] and trolling is so significant, that the authors write the following in their paper:

"... the associations between sadism and GAIT (Global Assessment of Internet Trolling) scores were so strong that it might be said that online trolls are prototypical everyday sadists." [emphasis added]

Trolls truly enjoy making you feel bad. To quote the authors once more (because this is a truly quotable article):

"Both trolls and sadists feel sadistic glee at the distress of others. Sadists just want to have fun ... and the Internet is their playground!"


Bloody hell.
trolls  sadism  narcissism  psychopaths  online  trolling  psychology  papers 
september 2014 by jm
A gut microbe that stops food allergies
Actual scientific research showing that antibiotic use may be implicated in allergies:

'Nagler’s team first confirmed that mice given antibiotics early in life were far more susceptible to peanut sensitization, a model of human peanut allergy. Then, they introduced a solution containing Clostridia, a common class of bacteria that’s naturally found in the mammalian gut, into the rodents’ mouths and stomachs. The animals’ food allergen sensitization disappeared, the team reports online today in the Proceedings of the National Academy of Sciences. When the scientists instead introduced another common kind of healthy bacteria, called Bacteroides, into similarly allergy-prone mice, they didn’t see the same effect. Studying the rodents more carefully, the researchers determined that Clostridia were having a surprising effect on the mouse gut: Acting through certain immune cells, the bacteria helped keep peanut proteins that can cause allergic reactions out of the bloodstream. “The bacteria are maintaining the integrity of the [intestinal] barrier,” Nagler says.'
allergies  health  food  peanuts  science  research  clostridium  bacteria  gut  intestines  immune-system  mice  papers  pnas 
september 2014 by jm
"Perspectives On The CAP Theorem" [pdf]
"We cannot achieve [CAP theorem] consistency and availability in a partition-prone network."
papers  cap  distcomp  cap-theorem  consistency  availability  partitions  network  reliability 
september 2014 by jm
"Invertible Bloom Lookup Tables" [paper]
'We present a version of the Bloom filter data structure that supports not only the insertion, deletion, and lookup of key-value pairs, but also allows a complete listing of the pairs it contains with high probability, as long the number of key- value pairs is below a designed threshold. Our structure allows the number of key-value pairs to greatly exceed this threshold during normal operation. Exceeding the threshold simply temporarily prevents content listing and reduces the probability of a successful lookup. If entries are later deleted to return the structure below the threshold, everything again functions appropriately. We also show that simple variations of our structure are robust to certain standard errors, such as the deletion of a key without a corresponding insertion or the insertion of two distinct values for a key. The properties of our structure make it suitable for several applications, including database and networking applications that we highlight.'
iblt  bloom-filters  data-structures  performance  algorithms  coding  papers  probabilistic 
september 2014 by jm
'Addressing the rebalancing problem in bike-sharing systems' [paper]
Many of the bike-sharing systems introduced around the world in the past 15 years have the same problem: Riders tend to take some routes and not others. As a result, the bikes tend to collect in a few places, which is a drag for users and a costly problem for the operators, who "rebalance" the system using trucks that take bikes from full stations to empty ones. Now, scientists are coming up with special algorithms to improve this process. One of them, developed by scientists at the Vienna University of Technology and the Austrian Institute of Technology, is now being tested in Vienna's bike-sharing system; another, developed at Cornell University, is already in use in New York City.


Timely -- here's what Dublin Bikes looked like this morning: https://twitter.com/jmason/status/503828246086295552

(via Andrew Caines)
cycling  bike-sharing  borisbikes  dublinbikes  rebalancing  fleet  availability  optimization  maths  papers  toread  algorithms 
august 2014 by jm
"In Search of an Understandable Consensus Algorithm"
Diego Ongaro and John Ousterhout, USENIX ATC 2014 -- won best paper for this paper on the Raft algorithm. (via Eoin Brazil)
raft  consensus  algorithms  distcomp  john-ousterhout  via:eoinbrazil  usenix  atc  papers  paxos 
july 2014 by jm
Google's Influential Papers for 2013
Googlers across the company actively engage with the scientific community by publishing technical papers, contributing open-source packages, working on standards, introducing new APIs and tools, giving talks and presentations, participating in ongoing technical debates, and much more. Our publications offer technical and algorithmic advances, feature aspects we learn as we develop novel products and services, and shed light on some of the technical challenges we face at Google. Below are some of the especially influential papers co-authored by Googlers in 2013.
google  papers  toread  reading  2013  scalability  machine-learning  algorithms 
july 2014 by jm
'Robust De-anonymization of Large Sparse Datasets' [pdf]
paper by Arvind Narayanan and Vitaly Shmatikov, 2008.

'We present a new class of statistical de- anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary's background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world's largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber's record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.'
anonymisation  anonymization  sanitisation  databases  data-dumps  privacy  security  papers 
june 2014 by jm
Facebook Doesn't Understand The Fuss About Its Emotion Manipulation Study
This is quite unethical, and I'm amazed it was published at all. Kashmir Hill at Forbes nails it:
While many users may already expect and be willing to have their behavior studied — and while that may be warranted with “research” being one of the 9,045 words in the data use policy — they don’t expect that Facebook will actively manipulate their environment in order to see how they react. That’s a new level of experimentation, turning Facebook from a fishbowl into a petri dish, and it’s why people are flipping out about this.


Shocking stuff. We need a new social publishing platform, built on ethical, open systems.
ethics  facebook  privacy  academia  depression  feelings  emotion  social-publishing  social  experimentation  papers 
june 2014 by jm
"Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" [PDF]
Google paper describing the infrastructure they've built for cross-service request tracing (ie. "tracer requests"). Features: low code changes required (since they've built it into the internal protobuf libs), low performance impact, sampling, deployment across the ~entire production fleet, output visibility in minutes, and has been live in production for over 2 years. Excellent read
dapper  tracing  http  services  soa  google  papers  request-tracing  tracers  protobuf  devops 
march 2014 by jm
"Understanding the Robustness of SSDs under Power Fault", FAST '13 [paper]
Horrific. SSDs (including "enterprise-class storage") storing sync'd writes in volatile RAM while claiming they were synced; one device losing 72.6GB, 30% of its data, after 8 injected power faults; and all SSDs tested displayed serious errors including random bit errors, metadata corruption, serialization errors and shorn writes. Don't trust lone unreplicated, unbacked-up SSDs!
pdf  papers  ssd  storage  reliability  safety  hardware  ops  usenix  serialization  shorn-writes  bit-errors  corruption  fsync 
january 2014 by jm
Dogs like to excrete in alignment with the Earth's magnetic field
Dogs preferred to excrete with the body being aligned along the North-south axis under calm magnetic field conditions.
dogs  poo  excrement  shit  magnetic-field  earth  zoology  papers 
january 2014 by jm
A Brief Tour of FLP Impossibility
One of the most important results in distributed systems theory was published in April 1985 by Fischer, Lynch and Patterson. Their short paper ‘Impossibility of Distributed Consensus with One Faulty Process’, which eventually won the Dijkstra award given to the most influential papers in distributed computing, definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment.

This particular result, known as the ‘FLP result’, settled a dispute that had been ongoing in distributed systems for the previous five to ten years. The problem of consensus – that is, getting a distributed network of processors to agree on a common value – was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps. In particular, the synchronous solution was resilient to faults, where processors crash and take no further part in the computation. Informally, synchronous models allow failures to be detected by waiting one entire step length for a reply from a processor, and presuming that it has crashed if no reply is received.

This kind of failure detection is impossible in an asynchronous setting, where there are no bounds on the amount of time a processor might take to complete its work and then respond with a message. Therefore it’s not possible to say whether a processor has crashed or is simply taking a long time to respond. The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem.
distributed-systems  flp  consensus-algorithms  algorithms  distcomp  papers  proofs 
november 2013 by jm
"Effective Computation of Biased Quantiles over Data Streams" [paper]

Skew is prevalent in many data sources such as IP traffic streams.To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two prob-lems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively, using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the “high-biased” quantiles and the “targeted” quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures.Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over high-speed data streams.


Implemented as a timer-histogram storage system in http://armon.github.io/statsite/ .
statistics  quantiles  percentiles  stream-processing  skew  papers  histograms  latency  algorithms 
november 2013 by jm
The Hole in Our Collective Memory: How Copyright Made Mid-Century Books Vanish - Rebecca J. Rosen - The Atlantic
A book published during the presidency of Chester A. Arthur has a greater chance of being in print today than one published during the time of Reagan.
This is not a gently sloping downward curve. Publishers seem unwilling to sell their books on Amazon for more than a few years after their initial publication. The data suggest that publishing business models make books disappear fairly shortly after their publication and long before they are scheduled to fall into the public domain. Copyright law then deters their reappearance as long as they are owned. On the left side of the graph before 1920, the decline presents a more gentle time-sensitive downward sloping curve.
business  books  legal  copyright  law  public-domain  reading  history  publishers  amazon  papers 
september 2013 by jm
Blocking The Pirate Bay appears to have 'no lasting net impact' on illegal downloading
In the fight against the unauthorised sharing of copyright protected material, aka piracy, Dutch Internet Service
Providers have been summoned by courts to block their subscribers’ access to The Pirate Bay (TPB) and related
sites. This paper studies the effectiveness of this approach towards online copyright enforcement, using both a
consumer survey and a newly developed non-infringing technology for BitTorrent monitoring. While a small
group of respondents download less from illegal sources or claim to have stopped, and a small but significant
effect is found on the distribution of Dutch peers, no lasting net impact is found on the percentage of the Dutch
population downloading from illegal sources.
fail  blocking  holland  pirate-bay  tpb  papers  via:tjmcintyre  internet  isps 
september 2013 by jm
_MillWheel: Fault-Tolerant Stream Processing at Internet Scale_ [paper, pdf]
from VLDB 2013:

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees.

This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
millwheel  google  data-processing  cep  low-latency  fault-tolerance  scalability  papers  event-processing  stream-processing 
august 2013 by jm
"Scalable Eventually Consistent Counters over Unreliable Networks" [paper, pdf]

Counters are an important abstraction in distributed computing, and
play a central role in large scale geo-replicated systems, counting events such as web page impressions or social network "likes". Classic distributed counters, strongly consistent, cannot be made both available and partition-tolerant, due to the CAP Theorem, being unsuitable to large scale scenarios.

This paper defi nes Eventually Consistent Distributed Counters (ECDC) and presents an implementation of the concept, Hando ff Counters, that is scalable and works over unreliable networks. By giving up the sequencer aspect of classic distributed counters, ECDC implementations can be made AP in the CAP design space, while retaining the essence of counting. Handoff Counters are the first CRDT (Conflict-free Replicated Data Type) based mechanism that overcomes the identity explosion problem in naive CRDTs, such as G-Counters (where state size is linear in the number of independent actors that ever incremented the counter), by managing identities towards avoiding global propagation, and garbage collecting temporary entries. The approach used in Hando ff Counters is not restricted to counters, being more generally applicable to other data types with associative and commutative operations.
pdf  papers  eventual-consistency  counters  distributed-systems  distcomp  cap-theorem  ecdc  handoff-counters  crdts  data-structures  g-counters 
august 2013 by jm
Randomly Failed! The State of Randomness in Current Java Implementations
This would appear to be the paper which sparked off the drama around BitCoin thefts from wallets generated on Android devices:

The SecureRandom PRNG is the primary source of randomness for Java and is used e.g., by cryptographic operations. This underlines its importance regarding security. Some of fallback solutions of the investigated implementations [are] revealed to be weak and predictable or capable of being influenced. Very alarming are the defects found in Apache Harmony, since it is partly used by Android.


More on the BitCoin drama: https://bitcointalk.org/index.php?topic=271486.40 , http://bitcoin.org/en/alert/2013-08-11-android
android  java  prng  random  security  bugs  apache-harmony  apache  crypto  bitcoin  papers 
august 2013 by jm
packetdrill - network stack testing tool
[Google's] packetdrill scripting tool enables quick, precise tests for entire TCP/UDP/IPv4/IPv6 network stacks, from the system call layer down to the NIC hardware. packetdrill currently works on Linux, FreeBSD, OpenBSD, and NetBSD. It can test network stack behavior over physical NICs on a LAN, or on a single machine using a tun virtual network device.
testing  networking  tun  google  linux  papers  tcp  ip  udp  freebsd  openbsd  netbsd 
july 2013 by jm
Locally Repairable Codes
Facebook’s new erasure coding algorithm (via High Scalability).
Disk I/O and network traffic were reduced by half compared to RS codes.
The LRC required 14% more storage than RS (ie. 60% of data size).
Repair times were much lower thanks to the local repair codes.
Much greater reliability thanks to fast repairs.
Reduced network traffic makes them suitable for geographic distribution.
erasure-coding  facebook  redundancy  repair  algorithms  papers  via:highscalability  data  storage  fault-tolerance 
june 2013 by jm
_Dynamic Histograms: Capturing Evolving Data Sets_ [pdf]

Currently, histograms are static structures: they are created from scratch periodically and their creation is based on looking at the entire data distribution as it exists each time. This creates problems, however, as data stored in DBMSs usually varies with time. If new data arrives at a high rate and old data is likewise deleted, a histogram’s accuracy may deteriorate fast as the histogram becomes older, and the optimizer’s effectiveness may be lost. Hence, how often a histogram is reconstructed becomes very critical, but choosing the right period is a hard problem, as the following trade-off exists: If the period is too long, histograms may become outdated. If the period is too short, updates of the histogram may incur a high overhead.

In this paper, we propose what we believe is the most elegant solution to the problem, i.e., maintaining dynamic histograms within given limits of memory space. Dynamic histograms are continuously updateable, closely tracking changes to the actual data. We consider two of the best static histograms proposed in the literature [9], namely V-Optimal and Compressed, and modify them. The new histograms are naturally called Dynamic V-Optimal (DVO) and Dynamic Compressed (DC). In addition, we modified V-Optimal’s partition constraint to create the Static Average-Deviation Optimal (SADO) and Dynamic Average-Deviation Optimal (DADO) histograms.


(via d2fn)
via:d2fn  histograms  streaming  big-data  data  dvo  dc  sado  dado  dynamic-histograms  papers  toread 
may 2013 by jm
The Why
How the Irish media are partly to blame for the catastrophic property bubble, from a paper entitled _The Role Of The Media In Propping Up Ireland’s Housing Bubble_, by Dr Julien Mercille, in the _Social Europe Journal_:
“The overall argument is that the Irish media are part and parcel of the political and corporate establishment, and as such the news they convey tend to reflect those sectors’ interests and views. In particular, the Celtic Tiger years involved the financialisation of the economy and a large property bubble, all of it wrapped in an implicit neoliberal ideology. The media, embedded within this particular political economy and itself a constitutive element of it, thus mostly presented stories sustaining it. In particular, news organisations acquired direct stakes in an inflated real estate market by purchasing property websites and receiving vital advertising revenue from the real estate sector. Moreover, a number of their board members were current or former high officials in the finance industry and government, including banks deeply involved in the bubble’s expansion."
economics  irish-times  ireland  newspapers  media  elite  insiders  bubble  property-bubble  property  celtic-tiger  papers  news  bias 
april 2013 by jm
It’s the Sugar, Folks
A study published in the Feb. 27 issue of the journal PLoS One links increased consumption of sugar with increased rates of diabetes by examining the data on sugar availability and the rate of diabetes in 175 countries over the past decade. And after accounting for many other factors, the researchers found that increased sugar in a population’s food supply was linked to higher diabetes rates independent of rates of obesity. In other words, according to this study, obesity doesn’t cause diabetes: sugar does.

The study demonstrates this with the same level of confidence that linked cigarettes and lung cancer in the 1960s. As Rob Lustig, one of the study’s authors and a pediatric endocrinologist at the University of California, San Francisco, said to me, “You could not enact a real-world study that would be more conclusive than this one.”
nytimes  health  food  via:fanf  sugar  eating  diabetes  papers  medicine 
february 2013 by jm
Distributed Streams Algorithms for Sliding Windows [PDF]
'Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate
functions over a “sliding window” of the N most recent data items in one or more streams. [...] Our results are obtained using a novel family of synopsis data structures called waves.'
waves  papers  streaming  algorithms  percentiles  histogram  distcomp  distributed  aggregation  statistics  estimation  streams 
february 2013 by jm
'Efficient Computation of Frequent and Top-k Elements in Data Streams' [paper, PDF]
The Space-Saving algorithm to compute top-k in a stream. I've been asking a variation of this problem as an interview question for a while now, pretty cool to find such a neat solution. Pity neither myself nor anyone I've interviewed has come up with it ;)
space-saving  approximation  streams  stream-processing  cep  papers  pdf  algorithms 
february 2013 by jm
Fast Packed String Matching for Short Patterns [paper, PDF]
'Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other
fields, like NLP, information retrieval and computational biology. In the last two decades a general trend has appeared
trying to exploit the power of the word RAM model to speed-up the
performances of classical string matching algorithms. [...]
In this paper we use specialized word-size packed string matching instructions, based on the Intel streaming SIMD extensions (SSE) technology, to design very fast string matching algorithms in the case of short patterns.' Reminds me of http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm , but taking advantage of SIMD extensions, which should make things nice and speedy, at the cost of tying it to specific hardware platforms. (via Tony Finch)
rabin-karp  algorithms  strings  string-matching  papers  via:fanf 
january 2013 by jm
_The Pauseless GC Algorithm_ [pdf]
Paper from USENIX VEE '05, by Cliff Click, Gil Tene, and Michael Wolf of Azul Systems, describing some details of the Azul secret sauce (via b6n)
via:b3n  azul  gc  jvm  java  usenix  papers 
december 2012 by jm
Spanner: Google's Globally-Distributed Database [PDF]

Abstract: Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

To appear in:
OSDI'12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October, 2012.
database  distributed  google  papers  toread  pdf  scalability  distcomp  transactions  cap  consistency 
september 2012 by jm
Universal properties of mythological networks - Abstract - EPL (Europhysics Letters) - IOPscience
Abstract:

As in statistical physics, the concept of universality plays an important, albeit qualitative, role in the field of comparative mythology. Here we apply statistical mechanical tools to analyse the networks underlying three iconic mythological narratives with a view to identifying common and distinguishing quantitative features. Of the three narratives, an Anglo-Saxon and a Greek text are mostly believed by antiquarians to be partly historically based while the third, an Irish epic [jm: "An Táin Bó Cúailnge", The Tain, to be specific], is often considered to be fictional. Here we use network analysis in an attempt to discriminate real from imaginary social networks and place mythological narratives on the spectrum between them. This suggests that the perceived artificiality of the Irish narrative can be traced back to anomalous features associated with six characters. Speculating that these are amalgams of several entities or proxies, renders the plausibility of the Irish text comparable to the others from a network-theoretic point of view.


Here's what the Irish Times said:

The society in the 1st century story of the Táin Bó Cúailnge looked artificial at first analysis of the networks between 404 characters in the story. However, the researchers found the society reflected real rather than fictional networks when the weakest links to six of the characters are removed.

These six characters included Medb, Queen of Connacht; Conchobor, King of Ulster and Cúchulainn. They were "similar to superheroes of the Marvel universe" and are "too superhuman" or too well-connected to be real, researchers said. The researchers suggest that each of these superhuman characters may be an amalgam of many which became fused and exaggerated as the story was passed down orally through generations.
networks  society  the-tain  epics  history  mythology  ireland  statistics  network-analysis  papers 
july 2012 by jm
'Poisoning Attacks against Support Vector Machines', Battista Biggio, Blaine Nelson, Pavel Laskov
The perils of auto-training SVMs on unvetted input.
We investigate a family of poisoning attacks against Support Vector Machines (SVM). Such attacks inject specially crafted training data that increases the SVM's test error. Central to the motivation for these attacks is the fact that most learning algorithms assume that their training data comes from a natural or well-behaved distribution. However, this assumption does not generally hold in security-sensitive settings. As we demonstrate, an intelligent adversary can, to some extent, predict the change of the SVM's decision function due to malicious input and use this ability to construct malicious data. The proposed attack uses a gradient ascent strategy in which the gradient is computed based on properties of the SVM's optimal solution. This method can be kernelized and enables the attack to be constructed in the input space even for non-linear kernels. We experimentally demonstrate that our gradient ascent procedure reliably identifies good local maxima of the non-convex validation error surface, which significantly increases the classifier's test error.

Via Alexandre Dulaunoy
papers  svm  machine-learning  poisoning  auto-learning  security  via:adulau 
july 2012 by jm
"K* : A Directed On-The-Fly Algorithm for Finding the k Shortest Paths", Husain Aljazzar and Stefan Leue, 2008
"We present a new algorithm, called K*, for finding the k shortest paths between a designated pair of vertices in a given directed weighted graph. Compared to Eppstein’s algorithm, which is the most prominent algorithm for solving this problem, K* has two advantages. First, K* performs on-the-fly, which means that
it does not require the graph to be explicitly available and stored in main memory. Portions of the graph will
be generated as needed. Second, K* is a directed algorithm which enables the use of heuristic functions
to guide the search. This leads to significant improvements in the memory and runtime demands for many
practical problem instances. We prove the correctness of K* and show that it maintains a worst-case runtime
complexity of O(m+k n log(k n)) and a space complexity of O(k n + m), where n is the number of vertices
and m is the number of edges of the graph. We provide experimental results which illustrate the scalability of
the algorithm."
graphs  k-shortest-paths  algorithms  papers 
june 2012 by jm
_Building High-level Features Using Large Scale Unsupervised Learning_ [paper, PDF]
"We consider the problem of building highlevel, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images using unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art."
algorithms  machine-learning  neural-networks  sgd  labelling  training  unlabelled-learning  google  research  papers  pdf 
june 2012 by jm
« earlier      
per page:    204080120160

related tags

3dsecure  4chan  abuse  academia  adaboost  adversarial-classification  aggregation  ai  air-gaps  alcohol  algorithms  allergies  amazon  android  anonymisation  anonymization  anti-spam  ap  apache  apache-harmony  approximation  apriori  architecture  archival  arms-race  articles  assistants  atc  audio  authentication  autism  auto-learning  autocorrelation  availability  azul  babies  bacteria  bandwidth  banking  banks  batch  best-practices  bias  big-data  bike-sharing  bild  bit-errors  bitcoin  blade  blocking  bloom-filters  books  booze  borg  borisbikes  botnets  bubble  bugs  bullying  business  c4.5  caching  call-graph  cap  cap-theorem  cart  cas  cassandra  cdc  cdt  ceas  celera  celtic-tiger  cep  chocolate  churnalism  clarity  classification  classifiers  clinical-trials  clocks  clostridium  cluster-inference  cluster-management  clustering  code-review  coding  color  colour  column-oriented  columnar-stores  compression  concurrency  conference  conflict-resolution  consensus  consensus-algorithms  consistency  copyright  corruption  cortana  counters  cpus  crash-only-software  crdt  crdts  crime  crypto  cybercrime  cycling  dado  dapper  data  data-dumps  data-mining  data-processing  data-structures  database  databases  dataviz  dc  deaths  debugging  decision-trees  deep-learning  defcon  deployment  depression  design  development  devops  diabetes  diet  distcomp  distributed  distributed-computing  distributed-systems  distsys  dnns  docker  dogs  dublinbikes  dvo  dynamic  dynamic-histograms  dynamo  e-coli  earth  eating  ebola  ebse  ecdc  echo  economics  elite  elsevier  em  emily-oster  emotion  engineering  epics  erasure-coding  errors  estimation  ethics  event-processing  events  eventual-consistency  evidence-based-medicine  evolution  exception-handling  exceptions  excrement  exfiltration  experimentation  f1  facebook  facts  fail  fail-fast  failure  false-positives  fault-tolerance  feelings  files  filesystems  finance  flash  fleet  flp  fmri  food  formal-methods  fps  freebsd  fridges  fsync  g-counters  gamergate  gaming  gc  genetics  genome  genomics  go  google  google-now  graphs  gsm  gsmem  gut  hadoop  halo  handoff-counters  hardware  harrassment  hbase  hcl  hdfs  health  healthcare  hidden-costs  histogram  histograms  history  hlc  hoeffding-tree  holland  horror  http  iblt  identity-theft  immune-system  impala  insiders  internet  intestines  iot  ip  ipa  ireland  irish-times  isps  java  jay-kreps  john-ousterhout  journalism  jvm  k-means  k-shortest-paths  kernel  knn  labelling  latency  law  learning  leeches  legal  liberia  libraries  linux  llvm  load-balancing  locators  logging  logical-clocks  logs  losers  low-latency  lww  machine-learning  magnetic-field  maintainance  mapreduce  maths  measles  media  medicine  memory  meteorological  metrics  mice  millwheel  mitm  ml  mmr  mobile-phones  model-checking  models  money  moscow  mri  mules  multicore  music  mysogyny  mythology  naive-bayes  narcissism  netbsd  network  network-analysis  networking  networks  neural-networks  neuroimaging  news  newspapers  nlp  nose-leech  nws  nytimes  ok-google  online  open-access  open-source  openbsd  operating-systems  ops  optimization  ordering  orm  osdi  outages  p2p  packing  pagerank  papers  partitions  patents  paxos  pdf  peanuts  peer-review  percentiles  performance  phishing  ping  pirate-bay  pirate-libraries  pirates  pnas  poisoning  poo  posix  pr  pre-pregnant  pregnancy  priority-queue  privacy  prng  probabilistic  production  proofs  property  property-bubble  protobuf  psychology  psychopaths  public-domain  publishers  qa  quantiles  rabin-karp  race-conditions  radio  raft  random  rdd  reading  rebalancing  reconstruction  redis  redundancy  reed-solomon  reliability  repair  replication  reprints  request-tracing  research  review  rgb  riak  ripq  russia  sadism  sado  safety  samizdat  samizdata  samsung  sanitisation  scalability  scale  scams  scheduling  schema  science  search  security  selenium  serialization  services  sexism  sgd  sharding  shipping-scams  shit  shorn-writes  siri  sirius  skew  slicer  soa  social  social-publishing  social-status  society  software  space-saving  spam  spamassassin  spark  sparrow  species  speech  spotify  sql  ssd  ssl  startup  static  statistics  storage  stream-processing  streaming  streams  string-matching  strings  stupidity  sugar  supervised  svm  synchronization  systems  tail-latencies  tcp  tdd  tech-debt  technical-debt  templates  testing  tests  the-tain  time  titan  tls  to-read  toread  tpb  tracers  tracing  training  transactions  transcriptional-regulatory-network  trolling  trolls  truetime  tun  tyrannobdella-rex  udp  unit-tests  unix  unlabelled-learning  unsupervised  usenix  ussr  vaccination  vaccines  verification  verified-by-visa  via:adriancolyer  via:adulau  via:antoin  via:b3n  via:conall  via:d2fn  via:eoinbrazil  via:fanf  via:highscalability  via:jd  via:markdennehy  via:mrneutron  via:tjmcintyre  via:waxy  video  visa  voldemort  waves  weather  web  wikipedia  women  zoology 

Copy this bookmark:



description:


tags: