jm + algorithms   240

Facial recognition software is not ready for use by law enforcement | TechCrunch
This is a pretty amazing op-ed from the CEO of a facial recognition software development company:

Facial recognition technologies, used in the identification of suspects, negatively affects people of color. To deny this fact would be a lie. And clearly, facial recognition-powered government surveillance is an extraordinary invasion of the privacy of all citizens — and a slippery slope to losing control of our identities altogether. There’s really no “nice” way to acknowledge these things.

I’ve been pretty clear about the potential dangers associated with current racial biases in face recognition, and open in my opposition to the use of the technology in law enforcement. As the black chief executive of a software company developing facial recognition services, I have a personal connection to the technology, both culturally and socially.

Having the privilege of a comprehensive understanding of how the software works gives me a unique perspective that has shaped my positions about its uses. As a result, I (and my company) have come to believe that the use of commercial facial recognition in law enforcement or in government surveillance of any kind is wrong — and that it opens the door for gross misconduct by the morally corrupt.
techcrunch  facial-recognition  computer-vision  machine-learning  racism  algorithms  america 
24 days ago by jm
Halton sequence
In statistics, Halton sequences are sequences used to generate points in space for numerical methods such as Monte Carlo simulations. Although these sequences are deterministic, they are of low discrepancy, that is, appear to be random for many purposes. They were first introduced in 1960 and are an example of a quasi-random number sequence.
algorithms  graphics  random  randomness  coding  monte-carlo-simulation  number-sequences 
4 weeks ago by jm
Fibonacci Hashing: The Optimization that the World Forgot (or: a Better Alternative to Integer Modulo)
Turns out I was wrong. This is a big one. And everyone should be using it. Hash tables should not be prime number sized and they should not use an integer modulo to map hashes into slots. Fibonacci hashing is just better. Yet somehow nobody is using it and lots of big hash tables (including all the big implementations of std::unordered_map) are much slower than they should be because they don’t use Fibonacci Hashing.


Apparently this is binary multiplicative hashing, and Google's brotli, webp, and Snappy libs all use a constant derived heuristically from a compression test corpus along the same lines (see comments).

(Via Michael Fogleman)
algorithms  hashing  hash  fibonacci  golden-ratio  coding  hacks  brotli  webp  snappy  hash-tables  hashmaps  load-distribution 
4 weeks ago by jm
_Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems_, ACM Transactions on Storage, July 2014
'The ever-growing amount of data requires highly scalable storage solutions. The most flexible approach is to use storage pools that can be expanded and scaled down by adding or removing storage devices. To make this approach usable, it is necessary to provide a solution to locate data items in such a dynamic environment. This article presents and evaluates the Random Slicing strategy, which incorporates lessons learned from table-based, rule-based, and pseudo-randomized hashing strategies and is able to provide a simple and efficient strategy that scales up to handle exascale data. Random Slicing keeps a small table with information about previous storage system insert and remove operations, drastically reducing the required amount of randomness while delivering a perfect load distribution.'
randomness  architecture  algorithms  storage  hashing  slicing  scaling 
7 weeks ago by jm
Palantir Knows Everything About You
This is so fucking dystopian:
Operation Laser has made L.A. cops more surgical — and, according to community activists, unrelenting. Once targets are enmeshed in a [Palantir] spidergram, they’re stuck.

Manuel Rios, 22, lives in the back of his grandmother’s house at the top of a hill in East L.A., in the heart of the city’s gang area. [...] He grew up surrounded by friends who joined Eastside 18, the local affiliate of the 18th Street gang, one of the largest criminal syndicates in Southern California. Rios says he was never “jumped in”—initiated into 18. He spent years addicted to crystal meth and was once arrested for possession of a handgun and sentenced to probation. But except for a stint in county jail for a burglary arrest inside a city rec center, he’s avoided further trouble and says he kicked his meth habit last year.

In 2016, Rios was sitting in a parked car with an Eastside 18 friend when a police car pulled up. His buddy ran, pursued by the cops, but Rios stayed put. “Why should I run? I’m not a gang member,” he says over steak and eggs at the IHOP near his home. The police returned and handcuffed him. One of them took his picture with a cellphone. “Welcome to the gang database!” the officer said.

Since then he’s been stopped more than a dozen times, he says, and told that if he doesn’t like it he should move. He has nowhere to go. His girlfriend just had a baby girl, and he wants to be around for them. “They say you’re in the system, you can’t lie to us,” he says. “I tell them, ‘How can I be in the hood if I haven’t got jumped in? Can’t you guys tell people who bang and who don’t?’ They go by their facts, not the real facts.”

The police, on autopilot with Palantir, are driving Rios toward his gang friends, not away from them, worries Mariella Saba, a neighbor and community organizer who helped him get off meth. When whole communities like East L.A. are algorithmically scraped for pre-crime suspects, data is destiny, says Saba. “These are systemic processes. When people are constantly harassed in a gang context, it pushes them to join. They internalize being told they’re bad.”
palantir  surveillance  privacy  precrime  spidergrams  future  la  gangs  justice  algorithms  data-protection  data-privacy  policing  harrassment 
april 2018 by jm
_Building a Bw-Tree Takes More Than Just Buzz Words_, SIGMOD 2018
'An account of our disappointing journey to build a open-source lock-free Bw-Tree for the Peloton DBMS.'

'In 2013, Microsoft Research proposed the Bw-Tree (humorously
termed the “Buzz Word Tree”), a lock-free index that provides high
throughput for transactional database workloads in SQL Server’s
Hekaton engine. The Bw-Tree avoids locks by appending delta
record to tree nodes and using an indirection layer that allows it to
atomically update physical pointers using compare-and-swap (CaS).
Correctly implementing this techniques requires careful attention
to detail. Unfortunately, the Bw-Tree papers from Microsoft are
missing important details and the source code has not been released.

This paper has two contributions: First, it is the missing guide
for how to build a lock-free Bw-Tree. We clarify missing points in
Microsoft’s original design documents and then present techniques
to improve the index’s performance. Although our focus here is on
the Bw-Tree, many of our methods apply more broadly to designing
and implementing future lock-free in-memory data structures. Our
experimental evaluation shows that our optimized variant achieves
1.1–2.5× better performance than the original Microsoft proposal
for highly concurrent workloads. Second, our evaluation shows
that despite our improvements, the Bw-Tree still does not perform
as well as other concurrent data structures that use locks.'

Finally: https://twitter.com/andy_pavlo/status/986647389820747776 :

'Our results show that @ViktorLeis's ART index and @xexd's MassTree and a non-fancy B+Tree are currently the best for in-memory workloads. Skip Lists are always terrible.'
skip-lists  algorithms  data-structures  storage  bw-trees  mass-trees  benchmarks  performance  multithreading  lock-free  locking  trees 
april 2018 by jm
Austerity is an Algorithm
Fucking hell, things sound grim Down Under:
Things changed in December 2016, when the government announced that the system had undergone full automation. Humans would no longer investigate anomalies in earnings. Instead, debt notices would be automatically generated when inconsistencies were detected. The government’s rationale for automating the process was telling. “Our aim is to ensure that people get what they are entitled to—no more and no less,” read the press release. “And to crack down hard when people deliberately defraud the system.”

The result was a disaster. I’ve had friends who’ve received an innocuous email urging them to check their MyGov account—an online portal available to Australian citizens with an internet connection to access a variety of government services—only to log in and find they’re hundreds or thousands of dollars in arrears, supposedly because they didn’t accurately report their income. Some received threats from private debt collectors, who told them their wages would be seized if they didn’t submit to a payment plan.

Those who wanted to contest their debts had to lodge a formal complaint, and were subjected to hours of Mozart’s Divertimento in F Major before they could talk to a case worker. Others tried taking their concerns directly to the Centrelink agency on Twitter, where they were directed to calling Lifeline, a 24-hour hotline for crisis support and suicide prevention.

At the end of 2015, my friend Chloe received a notice claiming she owed $20,000 to the government. She was told that she had reported her income incorrectly while on Youth Allowance, which provides financial assistance to certain categories of young people.

The figure was shocking and, like others in her position, she grew suspicious. She decided to contest the debt: she contacted all of her previous employers so she could gather pay slips, and scanned them into the MyGov app. “I gave them all of my information to prove that there was no way I owed them $20,000,” she says.

The bean counters were unmoved. They maintained that Chloe had reported her after-tax income instead of her before-tax income. As a result, they increased the amount she owed to $30,000. She agreed to a payment plan, which will see her pay off the debt in fortnightly installments of $50 over the course of two decades. “I even looked into bankruptcy because I was so stressed by it,” she says. “All I could think about was the Centrelink debt, and once they upped it to 30k, I was so ashamed and sad and miserable,” she says.
austerity  algorithms  automation  dystopia  australia  government  debt-collectors  robo-debt  dole  benefit  grim-meathook-future 
april 2018 by jm
lemire/JavaFastPFOR: A simple integer compression library in Java

a library to compress and uncompress arrays of integers very fast. The assumption is that most (but not all) values in your array use much less than 32 bits, or that the gaps between the integers use much less than 32 bits. These sort of arrays often come up when using differential coding in databases and information retrieval (e.g., in inverted indexes or column stores).

Please note that random integers are not compressible, by this library or by any other means. If you ever had the means of systematically compressing random integers, you could compress any data source to nothing, by recursive application of your technique.

This library can decompress integers at a rate of over 1.2 billions per second (4.5 GB/s). It is significantly faster than generic codecs (such as Snappy, LZ4 and so on) when compressing arrays of integers.

The library is used in LinkedIn Pinot, a realtime distributed OLAP datastore. Part of this library has been integrated in Parquet (http://parquet.io/). A modified version of the library is included in the search engine Terrier (http://terrier.org/). This libary is used by ClueWeb Tools (https://github.com/lintool/clueweb). It is also used by Apache NiFi.
compression  java  pfor  encoding  integers  algorithms  storage 
april 2018 by jm
'Fiction is outperforming reality': how YouTube's algorithm distorts truth
"no matter which political side the researcher started from, the platform pushed pro-Trump, anti-Clinton videos."
youtube  truth  fake-news  conspiracy-theories  google  algorithms  politics  brexit  trump 
february 2018 by jm
google/highwayhash: Fast strong hash functions: SipHash/HighwayHash
HighwayHash: 'We have devised a new way of mixing inputs with AVX2 multiply and permute instructions. The multiplications are 32x32 -> 64 bits and therefore infeasible to reverse. Permuting equalizes the distribution of the resulting bytes. The internal state occupies four 256-bit AVX2 registers. Due to limitations of the instruction set, the registers are partitioned into two 512-bit halves that remain independent until the reduce phase. The algorithm outputs 64 bit digests or up to 256 bits at no extra cost. In addition to high throughput, the algorithm is designed for low finalization cost. The result is more than twice as fast as SipTreeHash.

We also provide an SSE4.1 version (80% as fast for large inputs and 95% as fast for short inputs), an implementation for VSX on POWER and a portable version (10% as fast). A third-party ARM implementation is referenced below.

Statistical analyses and preliminary cryptanalysis are given in https://arxiv.org/abs/1612.06257.'

(via Tony Finch)
siphash  highwayhash  via:fanf  hashing  hashes  algorithms  mac  google  hash 
january 2018 by jm
How Syria's White Helmets became victims of an online propaganda machine | World news | The Guardian
The way the Russian propaganda machine has targeted the White Helmets is a neat case study in the prevailing information wars. It exposes just how rumours, conspiracy theories and half-truths bubble to the top of YouTube, Google and Twitter search algorithms.

“This is the heart of Russian propaganda. In the old days they would try and portray the Soviet Union as a model society. Now it’s about confusing every issue with so many narratives that people can’t recognise the truth when they see it,” said David Patrikarakos, author of War in 140 Characters: How Social Media is Reshaping Conflict in the 21st Century.
propaganda  white-helmets  russia  disinfo  syria  facebook  assad  google  youtube  fud  algorithms 
december 2017 by jm
The 10 Top Recommendations for the AI Field in 2017 from the AI Now Institute
I am 100% behind this. There's so much potential for hidden bias and unethical discrimination in careless AI/ML deployment.
While AI holds significant promise, we’re seeing significant challenges in the rapid push to integrate these systems into high stakes domains. In criminal justice, a team at Propublica, and multiple academics since, have investigated how an algorithm used by courts and law enforcement to predict recidivism in criminal defendants may be introducing significant bias against African Americans. In a healthcare setting, a study at the University of Pittsburgh Medical Center observed that an AI system used to triage pneumonia patients was missing a major risk factor for severe complications. In the education field, teachers in Texas successfully sued their school district for evaluating them based on a ‘black box’ algorithm, which was exposed to be deeply flawed.

This handful of examples is just the start — there’s much more we do not yet know. Part of the challenge is that the industry currently lacks standardized methods for testing and auditing AI systems to ensure they are safe and not amplifying bias. Yet early-stage AI systems are being introduced simultaneously across multiple areas, including healthcare, finance, law, education, and the workplace. These systems are increasingly being used to predict everything from our taste in music, to our likelihood of experiencing mental illness, to our fitness for a job or a loan.
ai  algorithms  machine-learning  ai-now  ethics  bias  racism  discrimination 
november 2017 by jm
A Branchless UTF-8 Decoder
This week I took a crack at writing a branchless UTF-8 decoder: a function that decodes a single UTF-8 code point from a byte stream without any if statements, loops, short-circuit operators, or other sorts of conditional jumps. [...] Why branchless? Because high performance CPUs are pipelined. That is, a single instruction is executed over a series of stages, and many instructions are executed in overlapping time intervals, each at a different stage.


Neat hack (via Tony Finch)
algorithms  optimization  unicode  utf8  branchless  coding  c  via:fanf 
october 2017 by jm
Google and Facebook Have Failed Us - The Atlantic
There’s no hiding behind algorithms anymore. The problems cannot be minimized. The machines have shown they are not up to the task of dealing with rare, breaking news events, and it is unlikely that they will be in the near future. More humans must be added to the decision-making process, and the sooner the better.
algorithms  facebook  google  las-vegas  news  filtering  hoaxes  4chan  abuse  breaking-news  responsibility  silicon-valley 
october 2017 by jm
London police’s use of AFR facial recognition falls flat on its face
A “top-of-the-line” automated facial recognition (AFR) system trialled for the second year in a row at London’s Notting Hill Carnival couldn’t even tell the difference between a young woman and a balding man, according to a rights group worker invited to view it in action. Because yes, of course they did it again: London’s Met police used controversial, inaccurate, largely unregulated automated facial recognition (AFR) technology to spot troublemakers. And once again, it did more harm than good.

Last year, it proved useless. This year, it proved worse than useless: it blew up in their faces, with 35 false matches and one wrongful arrest of somebody erroneously tagged as being wanted on a warrant for a rioting offense.

[...] During a recent, scathing US House oversight committee hearing on the FBI’s use of the technology, it emerged that 80% of the people in the FBI database don’t have any sort of arrest record. Yet the system’s recognition algorithm inaccurately identifies them during criminal searches 15% of the time, with black women most often being misidentified.
face-recognition  afr  london  notting-hill-carnival  police  liberty  met-police  privacy  data-privacy  algorithms 
september 2017 by jm
"Use trees. Not too deep. Mostly ensembles."
snarky summary of 'Data-driven Advice for Applying Machine Learning to Bioinformatics Problems', a recent analysis paper of ML algorithms
algorithms  machine-learning  bioinformatics  funny  advice  classification 
september 2017 by jm
Sattolo's algorithm
produces a randomized permutation of a list, with exactly one cycle (which guarantees that we will reach every element of the list even though we’re traversing it in random order)
algorithms  lists  permutation  random  randomization  cycles 
august 2017 by jm
consistent hashing with bounded loads
'an algorithm that combined consistent hashing with an upper limit on any one server’s load, relative to the average load of the whole pool.'

Lovely blog post from Vimeo's eng blog on a new variation on consistent hashing -- incorporating a concept of overload-avoidance -- and adding it to HAProxy and using it in production in Vimeo. All sounds pretty nifty! (via Toby DiPasquale)
via:codeslinger  algorithms  networking  performance  haproxy  consistent-hashing  load-balancing  lbs  vimeo  overload  load 
august 2017 by jm
The inventor of dynamic programming, had to hide the fact he was inventing it from the Secretary of Defense
"His face would suffuse, he would turn red, and he would get violent if people used the term "research" in his presence. You can imagine how he felt, then, about the term "mathematical". [....] I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside RAND"
rand  funny  history  insane  dr-strangelove  1950s  dynamic-programming  mathematics  algorithms 
june 2017 by jm
The Dark Secret at the Heart of AI - MIT Technology Review
'The mysterious mind of [NVidia's self-driving car, driven by machine learning] points to a looming issue with artificial intelligence. The car’s underlying AI technology, known as deep learning, has proved very powerful at solving problems in recent years, and it has been widely deployed for tasks like image captioning, voice recognition, and language translation. There is now hope that the same techniques will be able to diagnose deadly diseases, make million-dollar trading decisions, and do countless other things to transform whole industries.

But this won’t happen—or shouldn’t happen—unless we find ways of making techniques like deep learning more understandable to their creators and accountable to their users. Otherwise it will be hard to predict when failures might occur—and it’s inevitable they will. That’s one reason Nvidia’s car is still experimental.

Already, mathematical models are being used to help determine who makes parole, who’s approved for a loan, and who gets hired for a job. If you could get access to these mathematical models, it would be possible to understand their reasoning. But banks, the military, employers, and others are now turning their attention to more complex machine-learning approaches that could make automated decision-making altogether inscrutable. Deep learning, the most common of these approaches, represents a fundamentally different way to program computers. “It is a problem that is already relevant, and it’s going to be much more relevant in the future,” says Tommi Jaakkola, a professor at MIT who works on applications of machine learning. “Whether it’s an investment decision, a medical decision, or maybe a military decision, you don’t want to just rely on a ‘black box’ method.”'
ai  algorithms  ml  machine-learning  legibility  explainability  deep-learning  nvidia 
may 2017 by jm
How your selfie could affect your life insurance
Noping so hard. Imagine the levels of algorithmic discrimination inherent in this shit.
"Your face is something you wear all your life, and it tells a very unique story about you," says Karl Ricanek Jr., co-founder and chief data scientist at Lapetus Solutions Inc. in Wilmington, N.C.

Several life insurance companies are testing Lapetus technology that uses facial analytics and other data to estimate life expectancy, he says. (Lapetus would not disclose the names of companies testing its product.) Insurers use life expectancy estimates to make policy approval and pricing decisions. Lapetus says its product, Chronos, would enable a customer to buy life insurance online in as little as 10 minutes without taking a life insurance medical exam.
discrimination  computer-says-no  algorithms  selfies  face  lapetus  photos  life-insurance  life-expectancy 
may 2017 by jm
Rule by Nobody
'Algorithms update bureaucracy’s long-standing strategy for evasion.'
The need to optimize yourself for a network of opaque algorithms induces a sort of existential torture. In The Utopia of Rules: On Technology, Stupidity, and the Secret Joys of Bureaucracy, anthropologist David Graeber suggests a fundamental law of power dynamics: “Those on the bottom of the heap have to spend a great deal of imaginative energy trying to understand the social dynamics that surround them — including having to imagine the perspectives of those on top — while the latter can wander about largely oblivious to much of what is going on around them. That is, the powerless not only end up doing most of the actual, physical labor required to keep society running, they also do most of the interpretive labor as well.” This dynamic, Graeber argues, is built into all bureaucratic structures. He describes bureaucracies as “ways of organizing stupidity” — that is, of managing and reproducing these “extremely unequal structures of imagination” in which the powerful can disregard the perspectives of those beneath them in various social and economic hierarchies. Employees need to anticipate the needs of bosses; bosses need not reciprocate. People of color are forced to learn to accommodate and anticipate the ignorance and hostility of white people. Women need to be acutely aware of men’s intentions and feelings. And so on. Even benevolent-seeming bureaucracies, in Graeber’s view, have the effect of reinforcing “the highly schematized, minimal, blinkered perspectives typical of the powerful” and their privileges of ignorance and indifference toward those positioned as below them.
algorithms  bureaucracy  democracy  life  society  via:raycorrigan  technology  power 
may 2017 by jm
A Mind Is Born
A C=64 demo in 256 bytes! Awesome work. Use of an LFSR number generator to create the melody is particularly clever (via Craig)
art  programming  computers  demos  demoscene  c-64  via:craig  lfsr  algorithms 
april 2017 by jm
'Mathwashing,' Facebook and the zeitgeist of data worship
Fred Benenson: Mathwashing can be thought of using math terms (algorithm, model, etc.) to paper over a more subjective reality. For example, a lot of people believed Facebook was using an unbiased algorithm to determine its trending topics, even if Facebook had previously admitted that humans were involved in the process.
maths  math  mathwashing  data  big-data  algorithms  machine-learning  bias  facebook  fred-benenson 
april 2017 by jm
Image comparison algorithms
Awesome StackOverflow answer for detecting "similar" images -- promising approach to reimplement ffffound's similarity feature in mltshp, maybe
algorithms  hashing  comparison  diff  images  similarity  search  ffffound  mltshp 
april 2017 by jm
HyperBitBit
jomsdev notes:

'Last year, in the AofA’16 conference Robert Sedgewick proposed a new algorithm for cardinality estimation. Robert Sedgwick is a professor at Princeton with a long track of publications on combinatorial/randomized algorithms. He was a good friend of Philippe Flajolet (creator of Hyperloglog) and HyperBitBit it's based on the same ideas. However, it uses less memory than Hyperloglog and can provide the same results. On practical data, HyperBitBit, for N < 2^64 estimates cardinality within 10% using only 128 + 6 bits.'
algorithms  programming  cs  hyperloglog  estimation  cardinality  counting  hyperbitbit 
march 2017 by jm
Artificial intelligence is ripe for abuse, tech researcher warns: 'a fascist's dream' | Technology | The Guardian
“We should always be suspicious when machine learning systems are described as free from bias if it’s been trained on human-generated data,” Crawford said. “Our biases are built into that training data.”

In the Chinese research it turned out that the faces of criminals were more unusual than those of law-abiding citizens. “People who had dissimilar faces were more likely to be seen as untrustworthy by police and judges. That’s encoding bias,” Crawford said. “This would be a terrifying system for an autocrat to get his hand on.” [...]

With AI this type of discrimination can be masked in a black box of algorithms, as appears to be the case with a company called Faceception, for instance, a firm that promises to profile people’s personalities based on their faces. In its own marketing material, the company suggests that Middle Eastern-looking people with beards are “terrorists”, while white looking women with trendy haircuts are “brand promoters”.
bias  ai  racism  politics  big-data  technology  fascism  crime  algorithms  faceception  discrimination  computer-says-no 
march 2017 by jm
[1606.08813] European Union regulations on algorithmic decision-making and a "right to explanation"
We summarize the potential impact that the European Union's new General Data Protection Regulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on user-level predictors) which "significantly affect" users. The law will also effectively create a "right to explanation," whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large challenges for industry, it highlights opportunities for computer scientists to take the lead in designing algorithms and evaluation frameworks which avoid discrimination and enable explanation.


oh this'll be tricky.
algorithms  accountability  eu  gdpr  ml  machine-learning  via:daveb  europe  data-protection  right-to-explanation 
march 2017 by jm
US immigration asking tech interview trivia questions now
what the absolute fuck. Celestine Omin on Twitter: "I was just asked to balance a Binary Search Tree by JFK's airport immigration. Welcome to America."
twitter  celestine-omin  us-politics  immigration  tests  interviews  bst  trees  data-structures  algorithms 
february 2017 by jm
tdunning/t-digest
A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.


Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.
data-structures  algorithms  java  t-digest  statistics  quantiles  percentiles  aggregation  digests  estimation  ranking 
december 2016 by jm
Fast Forward Labs: Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters
Nice comparison of a counting Bloom filter and a Cuckoo Filter, implemented in Python:
This post provides an update by exploring Cuckoo filters, a new probabilistic data structure that improves upon the standard Bloom filter. The Cuckoo filter provides a few advantages: 1) it enables dynamic deletion and addition of items 2) it can be easily implemented compared to Bloom filter variants with similar capabilities, and 3) for similar space constraints, the Cuckoo filter provides lower false positives, particularly at lower capacities. We provide a python implementation of the Cuckoo filter here, and compare it to a counting Bloom filter (a Bloom filter variant).
algorithms  probabilistic  approximation  bloom-filters  cuckoo-filters  sets  estimation  python 
november 2016 by jm
Accidentally Quadratic — Rust hash iteration+reinsertion
It was recently discovered that some surprising operations on Rust’s standard hash table types could go quadratic.


Quite a nice unexpected accidental detour into O(n^2)
big-o  hashing  robin-hood-hashing  siphash  algorithms  hashtables  rust 
november 2016 by jm
Jeff Erickson's Algorithms, Etc.
This page contains lecture notes and other course materials for various algorithms classes I have taught at the University of Illinois, Urbana-Champaign. The notes are numbered in the order I cover the material in a typical undergraduate class, wtih notes on more advanced material (indicated by the symbol ♥) interspersed appropriately. [...] In addition to the algorithms notes I have been maintaining since 1999, this page also contains new notes on "Models of Computation", which cover a small subset of the material normally taught in undergraduate courses in formal languages and automata. I wrote these notes for a new junior-level course on "Algorithms and Models of Computation" that Lenny Pitt and I developed, which is now required for all undergraduate computer science and computer engineering majors at UIUC.


Via Tony Finch
via:fanf  book  cs  algorithms  jeff-erickson  uiuc 
november 2016 by jm
MemC3: Compact and concurrent Memcache with dumber caching and smarter hashing
An improved hashing algorithm called optimistic cuckoo hashing, and a CLOCK-based eviction algorithm that works in tandem with it. They are evaluated in the context of Memcached, where combined they give up to a 30% memory usage reduction and up to a 3x improvement in queries per second as compared to the default Memcached implementation on read-heavy workloads with small objects (as is typified by Facebook workloads).
memcached  performance  key-value-stores  storage  databases  cuckoo-hashing  algorithms  concurrency  caching  cache-eviction  memory  throughput 
november 2016 by jm
Facebook scuppers Admiral Insurance plan to base premiums on your posts
Well, this is amazingly awful:
The Guardian claims to have further details of the kind of tell-tale signs that Admiral's algorithmic analysis would have looked out for in Facebook posts. Good traits include "writing in short concrete sentences, using lists, and arranging to meet friends at a set time and place, rather than just 'tonight'." On the other hand, "evidence that the Facebook user might be overconfident—such as the use of exclamation marks and the frequent use of 'always' or 'never' rather than 'maybe'—will count against them."


The future is shitty.
insurance  facebook  scoring  computer-says-no  algorithms  text-analysis  awful  future 
november 2016 by jm
Here's Why Facebook's Trending Algorithm Keeps Promoting Fake News - BuzzFeed News
Kalina Bontcheva leads the EU-funded PHEME project working to compute the veracity of social media content. She said reducing the amount of human oversight for Trending heightens the likelihood of failures, and of the algorithm being fooled by people trying to game it.
“I think people are always going to try and outsmart these algorithms — we’ve seen this with search engine optimization,” she said. “I’m sure that once in a while there is going to be a very high-profile failure.”
Less human oversight means more reliance on the algorithm, which creates a new set of concerns, according to Kate Starbird, an assistant professor at the University of Washington who has been using machine learning and other technology to evaluate the accuracy of rumors and information during events such as the Boston bombings.
“[Facebook is] making an assumption that we’re more comfortable with a machine being biased than with a human being biased, because people don’t understand machines as well,” she said.
facebook  news  gaming  adversarial-classification  pheme  truth  social-media  algorithms  ml  machine-learning  media 
october 2016 by jm
Remarks at the SASE Panel On The Moral Economy of Tech
Excellent talk. I love this analogy for ML applied to real-world data which affects people:
Treating the world as software promotes fantasies of control. And the best kind of control is control without responsibility. Our unique position as authors of software used by millions gives us power, but we don't accept that this should make us accountable. We're programmers—who else is going to write the software that runs the world? To put it plainly, we are surprised that people seem to get mad at us for trying to help. Fortunately we are smart people and have found a way out of this predicament. Instead of relying on algorithms, which we can be accused of manipulating for our benefit, we have turned to machine learning, an ingenious way of disclaiming responsibility for anything. Machine learning is like money laundering for bias. It's a clean, mathematical apparatus that gives the status quo the aura of logical inevitability. The numbers don't lie.


Particularly apposite today given Y Combinator's revelation that they use an AI bot to help 'sift admission applications', and don't know what criteria it's using: https://twitter.com/aprjoy/status/783032128653107200
culture  ethics  privacy  technology  surveillance  ml  machine-learning  bias  algorithms  software  control 
october 2016 by jm
[net-next,14/14] tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".

BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.

BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters. [....]

Signed-off-by: Van Jacobson <vanj@google.com>
google  linux  tcp  ip  congestion-control  bufferbloat  patches  algorithms  rtt  bandwidth  youtube  via:bradfitz 
september 2016 by jm
Highly Available Counters Using Cassandra
solid discussion of building HA counters using CRDTs and similar eventually-consistent data structures
crdts  algorithms  data-structures  cassandra  ha  counters 
september 2016 by jm
Algorithmic management as the new Taylorism
'its legacy can be seen in factories, call centres and warehouses today, although new technology has taken the place of Taylor’s instruction cards and stopwatches. Many warehouse workers for companies such as Amazon use handheld devices that give them step-by-step instructions on where to walk and what to pick from the shelves when they get there, all the while measuring their “pick rate” in real time. For Jeremias Prassl, a law professor at Oxford university, the algorithmic management techniques of Uber and Deliveroo are Taylorism 2.0. “Algorithms are providing a degree of control and oversight that even the most hardened Taylorists could never have dreamt of,” he says.'
algorithms  labour  work  labor  taylorism  management  silicon-valley  tech  deliveroo  uber  piece-work 
september 2016 by jm
Ratas - A hierarchical timer wheel
excellent explanation and benchmarks of a timer wheel implementation
timer-wheels  timing-wheels  algorithms  c  linux  timers  data-structures 
august 2016 by jm
A fast alternative to the modulo reduction
(x * N) div 2^32 is an equally fair map reduction, but faster on modern 64-bit CPUs
fairness  modulo  arithmetic  algorithms  fair-mapping  reduce  daniel-lemire 
june 2016 by jm
In Wisconsin, a Backlash Against Using Data to Foretell Defendants’ Futures - The New York Times
More trial-by-algorithm horrors:
Company officials say the algorithm’s results are backed by research, but they are tight-lipped about its details. They do acknowledge that men and women receive different assessments, as do juveniles, but the factors considered and the weight given to each are kept secret.

“The key to our product is the algorithms, and they’re proprietary,” said Jeffrey Harmon, Northpointe’s general manager. “We’ve created them, and we don’t release them because it’s certainly a core piece of our business. It’s not about looking at the algorithms. It’s about looking at the outcomes.”

That secrecy is at the heart of Mr. Loomis’s lawsuit. His lawyer, Michael D. Rosenberg, who declined to be interviewed because of the pending appeal, argued that Mr. Loomis should be able to review the algorithm and make arguments about its validity as part of his defense. He also challenges the use of different scales for each sex.

The Compas system, Mr. Rosenberg wrote in his brief, “is full of holes and violates the requirement that a sentence be individualized.”
ethics  compas  sentencing  wisconsin  northpointe  law  trial-by-algorithm  algorithms 
june 2016 by jm
Differential Privacy
Apple have announced they plan to use it; Google use a DP algorithm called RAPPOR in Chrome usage statistics. In summary: "novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users".
apple  privacy  anonymization  google  rappor  algorithms  sampling  populations  statistics  differential-privacy 
june 2016 by jm
The tyranny of the algorithm yet again...
Paypal will no longer handle payments if the user's address includes the word "Isis":
That these place names exist won't be a surprise to anyone familiar with English limnology - the study of rivers and inland waters. As Wikipedia helpfully tells us, "The Isis is the name given to the part of the River Thames above Iffley Lock which flows through the university city of Oxford". In at least one local primary school I'm familiar with, the classes are called Windrush, Cherwell, Isis and Thames.

[...] Now PayPal has decided that they are not prepared to facilitate payments for goods to be delivered to an address which includes the word "Isis". An Isis street resident ran into some unexpected difficulties when attempting to purchase a small quantity of haberdashery on the internet with the aid of a PayPal account. The transaction would not process. In puzzlement she eventually got irritated enough to brave the 24/7 customer support telephone tag labyrinth. The short version of the response from the eventual real person she managed to get through to was that PayPal have blacklisted addresses which include the name "Isis". They will not process payments for goods to be delivered to an Isis related address, whatever state of privileged respectability the residents of such properties may have earned or inherited in their lifetimes to this point.


One has to wonder if this also brings the risk of adding the user to a secret list, somewhere. Trial by algorithm.
isis  algorithms  automation  fail  law-enforcement  paypal  uk  rivers 
june 2016 by jm
Social Network Algorithms Are Distorting Reality By Boosting Conspiracy Theories | Co.Exist | ideas + impact
In his 1962 book, The Image: A Guide to Pseudo-Events in America, former Librarian of Congress Daniel J. Boorstin describes a world where our ability to technologically shape reality is so sophisticated, it overcomes reality itself. "We risk being the first people in history," he writes, "to have been able to make their illusions so vivid, so persuasive, so ‘realistic’ that they can live in them."
algorithms  facebook  ethics  filtering  newsfeed  conspiracy-theories  twitter  viral  crazy 
may 2016 by jm
Darts, Dice, and Coins
Earlier this year, I asked a question on Stack Overflow about a data structure for loaded dice. Specifically, I was interested in answering this question: "You are given an n-sided die where side i has probability pi of being rolled. What is the most efficient data structure for simulating rolls of the die?"

This data structure could be used for many purposes. For starters, you could use it to simulate rolls of a fair, six-sided die by assigning probability 1616 to each of the sides of the die, or a to simulate a fair coin by simulating a two-sided die where each side has probability 1212 of coming up. You could also use this data structure to directly simulate the total of two fair six-sided dice being thrown by having an 11-sided die (whose faces were 2, 3, 4, ..., 12), where each side was appropriately weighted with the probability that this total would show if you used two fair dice. However, you could also use this data structure to simulate loaded dice. For example, if you were playing craps with dice that you knew weren't perfectly fair, you might use the data structure to simulate many rolls of the dice to see what the optimal strategy would be. You could also consider simulating an imperfect roulette wheel in the same way.

Outside the domain of game-playing, you could also use this data structure in robotics simulations where sensors have known failure rates. For example, if a range sensor has a 95% chance of giving the right value back, a 4% chance of giving back a value that's too small, and a 1% chance of handing back a value that's too large, you could use this data structure to simulate readings from the sensor by generating a random outcome and simulating the sensor reading in that case.

The answer I received on Stack Overflow impressed me for two reasons. First, the solution pointed me at a powerful technique called the alias method that, under certain reasonable assumptions about the machine model, is capable of simulating rolls of the die in O(1)O(1) time after a simple preprocessing step. Second, and perhaps more surprisingly, this algorithm has been known for decades, but I had not once encountered it! Considering how much processing time is dedicated to simulation, I would have expected this technique to be better- known. A few quick Google searches turned up a wealth of information on the technique, but I couldn't find a single site that compiled together the intuition and explanation behind the technique.


(via Marc Brooker)
via:marcbrooker  algorithms  probability  algorithm  coding  data-structures  alias  dice  random 
april 2016 by jm
Rendezvous hashing - Wikipedia, the free encyclopedia

Rendezvous or Highest Random Weight (HRW) hashing[1][2] is an algorithm that allows clients to achieve distributed agreement on a set of k options out of a possible set of n options. A typical application is when clients need to agree on which sites (or proxies) objects are to assigned to. When k is 1, it subsumes the goals of consistent hashing, using an entirely different method.
hrw  hashing  hashes  consistent-hashing  rendezvous-hashing  algorithms  discovery  distributed-computing 
april 2016 by jm
BLAKE2: simpler, smaller, fast as MD5
'We present the cryptographic hash function BLAKE2, an improved version
of the SHA-3 finalist BLAKE optimized for speed in software. Target applications include
cloud storage, intrusion detection, or version control systems. BLAKE2 comes
in two main flavors: BLAKE2b is optimized for 64-bit platforms, and BLAKE2s for
smaller architectures. On 64-bit platforms, BLAKE2 is often faster than MD5, yet provides
security similar to that of SHA-3. We specify parallel versions BLAKE2bp and
BLAKE2sp that are up to 4 and 8 times faster, by taking advantage of SIMD and/or
multiple cores. BLAKE2 has more benefits than just speed: BLAKE2 uses up to 32%
less RAM than BLAKE, and comes with a comprehensive tree-hashing mode as well
as an efficient MAC mode.'
crypto  hash  blake2  hashing  blake  algorithms  sha1  sha3  simd  performance  mac 
april 2016 by jm
“Racist algorithms” and learned helplessness
Whenever I’ve had to talk about bias in algorithms, I’ve tried be  careful to emphasize that it’s not that we shouldn’t use algorithms in search, recommendation and decision making. It’s that we often just don’t know how they’re making their decisions to present answers, make recommendations or arrive at conclusions, and it’s this lack of transparency that’s worrisome. Remember, algorithms aren’t just code.

What’s also worrisome is the amplifier effect. Even if “all an algorithm is doing” is reflecting and transmitting biases inherent in society, it’s also amplifying and perpetuating them on a much larger scale than your friendly neighborhood racist. And that’s the bigger issue. [...] even if the algorithm isn’t creating bias, it’s creating a feedback loop that has powerful perception effects.
feedback  bias  racism  algorithms  software  systems  society 
april 2016 by jm
Elias gamma coding
'used most commonly when coding integers whose upper-bound cannot be determined beforehand.'
data-structures  algorithms  elias-gamma-coding  encoding  coding  numbers  integers 
april 2016 by jm
Hashed Wheel Timer
nice java impl of this efficient data structure, broken out from Project Reactor
scalability  java  timers  hashed-wheel-timers  algorithms  data-structures 
march 2016 by jm
Hey Microsoft, the Internet Made My Bot Racist, Too
All machine learning algorithms strive to exaggerate and perpetuate the past. That is, after all, what they are learning from. The fundamental assumption of every machine learning algorithm is that the past is correct, and anything coming in the future will be, and should be, like the past. This is a fine assumption to make when you are Netflix trying to predict what movie you’ll like, but is immoral when applied to many other situations. For bots like mine and Microsoft’s, built for entertainment purposes, it can lead to embarrassment. But AI has started to be used in much more meaningful ways: predictive policing in Chicago, for example, has already led to widespread accusations of racial profiling.
This isn’t a little problem. This is a huge problem, and it demands a lot more attention then it’s getting now, particularly in the community of scientists and engineers who design and apply these algorithms. It’s one thing to get cursed out by an AI, but wholly another when one puts you in jail, denies you a mortgage, or decides to audit you.
machine-learning  ml  algorithms  future  society  microsoft 
march 2016 by jm
Conversant ConcurrentQueue and Disruptor BlockingQueue
'Disruptor is the highest performing intra-thread transfer mechanism available in Java. Conversant Disruptor is the highest performing implementation of this type of ring buffer queue because it has almost no overhead and it exploits a particularly simple design.

Conversant has been using this in production since 2012 and the performance is excellent. The BlockingQueue implementation is very stable, although we continue to tune and improve it. The latest release, 1.2.4, is 100% production ready.

Although we have been working on it for a long time, we decided to open source our BlockingQueue this year to contribute something back to the community. ... its a drop in for BlockingQueue, so its a very easy test. Conversant Disruptor will crush ArrayBlockingQueue and LinkedTransferQueue for thread to thread transfers.

In our system, we noticed a 10-20% reduction in overall system load and latency when we introduced it.'
disruptor  blocking-queues  queues  queueing  data-structures  algorithms  java  conversant  concurrency  performance 
march 2016 by jm
How to do distributed locking
A critique of the "Redlock" locking algorithm from Redis by Martin Kleppman. antirez responds here: http://antirez.com/news/101 ; summary of followups: https://storify.com/martinkl/redlock-discussion
distributed  locking  redis  algorithms  coding  distcomp  redlock  martin-kleppman  zookeeper 
february 2016 by jm
research!rsc: Zip Files All The Way Down
quine.zip, quine.gz, and quine.tar.gz. Here's what happens when you mail it through bad AV software: https://twitter.com/FioraAeterna/status/694655296707297281
zip  algorithms  compression  quines  fun  hacks  gzip 
february 2016 by jm
Very Fast Reservoir Sampling
via Tony Finch. 'In this post I will demonstrate how to do reservoir sampling orders of magnitude faster than the traditional “naive” reservoir sampling algorithm, using a fast high-fidelity approximation to the reservoir sampling-gap distribution.'
statistics  reservoir-sampling  sampling  algorithms  poisson  bernoulli  performance 
december 2015 by jm
[LUCENE-6917] Deprecate and rename NumericField/RangeQuery to LegacyNumeric - ASF JIRA
Interesting performance-related tweak going into Lucene -- based on the Bkd-Tree I think: https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf . Being used for all numeric index types, not just multidimensional ones?
lucene  performance  algorithms  patches  bkd-trees  geodata  numeric  indexing 
december 2015 by jm
ELS: latency based load balancer, part 1
ELS measures the following things:

Success latency and success rate of each machine;
Number of outstanding requests between the load balancer and each machine. These are the requests that have been sent out but we haven’t yet received a reply;
Fast failures are better than slow failures, so we also measure failure latency for each machine.

Since users care a lot about latency, we prefer machines that are expected to answer quicker. ELS therefore converts all the measured metrics into expected latency from the client’s perspective.[...]

In short, the formula ensures that slower machines get less traffic and failing machines get much less traffic. Slower and failing machines still get some traffic, because we need to be able to detect when they come back up again.
latency  spotify  proxies  load-balancing  els  algorithms  c3  round-robin  load-balancers  routing 
december 2015 by jm
Accretion Disc Series - Clint Fulkerson
available as prints -- vector art with a hint of the bacterial
algorithms  art  graphics  vector  bacteria  petri-dish  clint-fulkerson 
november 2015 by jm
Cache-friendly binary search
by reordering items to optimize locality. Via aphyr's dad!
caches  cache-friendly  optimization  data-locality  performance  coding  algorithms 
november 2015 by jm
Caffeine cache adopts Window TinyLfu eviction policy
'Caffeine is a Java 8 rewrite of Guava's cache. In this version we focused on improving the hit rate by evaluating alternatives to the classic least-recenty-used (LRU) eviction policy. In collaboration with researchers at Israel's Technion, we developed a new algorithm that matches or exceeds the hit rate of the best alternatives (ARC, LIRS). A paper of our work is being prepared for publication.'

Specifically:
W-TinyLfu uses a small admission LRU that evicts to a large Segmented LRU if accepted by the TinyLfu admission policy. TinyLfu relies on a frequency sketch to probabilistically estimate the historic usage of an entry. The window allows the policy to have a high hit rate when entries exhibit a high temporal / low frequency access pattern which would otherwise be rejected. The configuration enables the cache to estimate the frequency and recency of an entry with low overhead. This implementation uses a 4-bit CountMinSketch, growing at 8 bytes per cache entry to be accurate. Unlike ARC and LIRS, this policy does not retain non-resident keys.
tinylfu  caches  caching  cache-eviction  java8  guava  caffeine  lru  count-min  sketching  algorithms 
november 2015 by jm
How Facebook avoids failures
Great paper from Ben Maurer of Facebook in ACM Queue.
A "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves.


This is full of interesting techniques.

* Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert.

* Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills.

* Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm -- this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker).

* Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed?

* Learning from Failure: the DERP (!) methodology,
ben-maurer  facebook  reliability  algorithms  codel  circuit-breakers  derp  failure  ops  cubism  horizon-charts  charts  dependencies  soa  microservices  uptime  deployment  configuration  change-management 
november 2015 by jm
Apache Kafka, Purgatory, and Hierarchical Timing Wheels
In the new design, we use Hierarchical Timing Wheels for the timeout timer and DelayQueue of timer buckets to advance the clock on demand. Completed requests are removed from the timer queue immediately with O(1) cost. The buckets remain in the delay queue, however, the number of buckets is bounded. And, in a healthy system, most of the requests are satisfied before timeout, and many of the buckets become empty before pulled out of the delay queue. Thus, the timer should rarely have the buckets of the lower interval. The advantage of this design is that the number of requests in the timer queue is the number of pending requests exactly at any time. This allows us to estimate the number of requests need to be purged. We can avoid unnecessary purge operation of the watcher lists. As the result we achieve a higher scalability in terms of request rate with much better CPU usage.
algorithms  timers  kafka  scheduling  timing-wheels  delayqueue  queueing 
october 2015 by jm
The New InfluxDB Storage Engine: A Time Structured Merge Tree
The new engine has similarities with LSM Trees (like LevelDB and Cassandra’s underlying storage). It has a write ahead log, index files that are read only, and it occasionally performs compactions to combine index files. We’re calling it a Time Structured Merge Tree because the index files keep contiguous blocks of time and the compactions merge those blocks into larger blocks of time. Compression of the data improves as the index files are compacted. Once a shard becomes cold for writes it will be compacted into as few files as possible, which yield the best compression.
influxdb  storage  lsm-trees  leveldb  tsm-trees  data-structures  algorithms  time-series  tsd  compression 
october 2015 by jm
Introduction to HDFS Erasure Coding in Apache Hadoop
How Hadoop did EC. Erasure Coding support ("HDFS-EC") is set to be released in Hadoop 3.0 apparently
erasure-coding  reed-solomon  algorithms  hadoop  hdfs  cloudera  raid  storage 
september 2015 by jm
Brotli: a new compression algorithm for the internet from Google
While Zopfli is Deflate-compatible, Brotli is a whole new data format. This new format allows us to get 20–26% higher compression ratios over Zopfli. In our study ‘Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms’ we show that Brotli is roughly as fast as zlib’s Deflate implementation. At the same time, it compresses slightly more densely than LZMA and bzip2 on the Canterbury corpus. The higher data density is achieved by a 2nd order context modeling, re-use of entropy codes, larger memory window of past data and joint distribution codes. Just like Zopfli, the new algorithm is named after Swiss bakery products. Brötli means ‘small bread’ in Swiss German.
brotli  zopfli  deflate  gzip  compression  algorithms  swiss  google 
september 2015 by jm
The Pixel Factory
amazing slideshow/WebGL demo talking about graphics programming, its maths, and GPUs
maths  graphics  webgl  demos  coding  algorithms  slides  tflops  gpus 
september 2015 by jm
Algorithmist
The Algorithmist is a resource dedicated to anything algorithms - from the practical realm, to the theoretical realm. There are also links and explanation to problemsets.


A wiki for algorithms. Not sure if this is likely to improve on Wikipedia, which of course covers the same subject matter quite well, though
algorithms  reference  wikis  coding  data-structures 
september 2015 by jm
Mining High-Speed Data Streams: The Hoeffding Tree Algorithm
This paper proposes a decision tree learner for data streams, the Hoeffding Tree algorithm, which comes with the guarantee that the learned decision tree is asymptotically nearly identical to that of a non-incremental learner using infinitely many examples. This work constitutes a significant step in developing methodology suitable for modern ‘big data’ challenges and has initiated a lot of follow-up research. The Hoeffding Tree algorithm has been covered in various textbooks and is available in several public domain tools, including the WEKA Data Mining platform.
hoeffding-tree  algorithms  data-structures  streaming  streams  cep  decision-trees  ml  learning  papers 
august 2015 by jm
Sorting out graph processing
Some nice real-world experimentation around large-scale data processing in differential dataflow:
If you wanted to do an iterative graph computation like PageRank, it would literally be faster to sort the edges from scratch each and every iteration, than to use unsorted edges. If you want to do graph computation, please sort your edges.

Actually, you know what: if you want to do any big data computation, please sort your records. Stop talking sass about how Hadoop sorts things it doesn't need to, read some papers, run some tests, and then sort your damned data. Or at least run faster than me when I sort your data for you.
algorithms  graphs  coding  data-processing  big-data  differential-dataflow  radix-sort  sorting  x-stream  counting-sort  pagerank 
august 2015 by jm
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Extremely authoritative slide deck on building a recommendation system, from Xavier Amatriain, Research/Engineering Manager at Netflix
netflix  recommendations  recommenders  ml  machine-learning  cmu  clustering  algorithms 
august 2015 by jm
choco-solver.org
Choco is [FOSS] dedicated to Constraint Programming[2]. It is a Java library written under BSD license. It aims at describing hard combinatorial problems in the form of Constraint Satisfaction Problems and solving them with Constraint Programming techniques. The user models its problem in a declarative way by stating the set of constraints that need to be satisfied in every solution. Then, Choco solves the problem by alternating constraint filtering algorithms with a search mechanism. [...]
Choco is among the fastest CP solvers on the market. In 2013 and 2014, Choco has been awarded many medals at the MiniZinc challenge that is the world-wide competition of constraint-programming solvers.
choco  constraint-programming  solving  search  combinatorial  algorithms 
august 2015 by jm
« earlier      
per page:    204080120160

related tags

3d  4chan  1950s  abuse  accountability  adaboost  adaptive-sampling  adversarial-classification  advice  adwords  afr  aggregation  aho-corasick  ai  ai-now  algorithm  algorithms  alias  amazon  america  analysis  analytics  android  animated-gifs  animation  anomalies  anomaly-detection  anonymization  antispam  apache  aphorisms  apple  approximate  approximation  april-fools-day  apriori  architecture  arithmetic  arithmetic-coding  arrays  art  articles  assad  atc  atomic  austerity  australia  automation  availability  awful  azul  b+trees  b-trees  babel  backoff  backup  bacteria  bam  bandwidth  batch  ben-maurer  benchmarks  benefit  bernoulli  bias  big-data  big-o  bignums  bike-sharing  bill-davidow  binary  binary-heap  binary-search  binary-tree  bioinformatics  bit-twiddling  bitcoin  bitmap-indexes  bitmaps  bits  bitset  bitsets  bkd-trees  blade  blake  blake2  blast  blockchain  blocking-queues  bloom-filter  bloom-filters  book  books  borges  borisbikes  bots  boundary  bowtie  branchless  breaking-news  brexit  bricolage  brotli  bst  btrfs  bucketing  bufferbloat  bugs  bureaucracy  burrows-wheeler  burrows-wheeler-transform  bw-trees  bwt  bytebuffer  c  c++  c-64  c3  c4.5  cache-eviction  cache-friendly  caches  caching  caffeine  cap  cardinality  cart  cas  cassandra  cbf  celestine-omin  cep  change-management  charts  chatroulette  cheat-sheet  chloropleth  choco  circuit-breakers  classification  clever  cli  cliff-click  clint-fulkerson  clocks  clojure  cloudera  clustering  cmu  codel  coding  colmmacc  column-oriented  columnar-stores  combinatorial  comics  commandline  commentary  compaction  comparison  compas  complexity  compressed-bitmaps  compression  computation  computer-says-no  computer-science  computer-vision  computers  computervision  concurrency  confidence-sort  configuration  congestion  congestion-control  consensus  consensus-algorithms  consistency  consistent-hashing  conspiracy-theories  constraint-programming  consumer  content-spinning  continuous  control  conversant  conways-life  cool  copying  copyright  count-min  counters  counting  counting-sort  cpu  crapjects  crazy  crdt  crdts  crf++  crime  crispr  crypto  cs  csail  cubism  cuckoo-filters  cuckoo-hashing  culture  currency  cycles  cycling  d-left-hashing  daemons  damping  daniel-lemire  data  data-locality  data-mining  data-privacy  data-processing  data-protection  data-structures  database  databases  datamining  datastructures  dataviz  debt-collectors  decay  decision-trees  deep-learning  deflate  delayqueue  deliveroo  democracy  demos  demoscene  dependencies  deployment  derp  dfa  dice  diff  differential-dataflow  differential-privacy  digests  dirichlet  discovery  discrimination  discussion  disinfo  disruptor  distcomp  distributed  distributed-computing  distributed-systems  distsys  dithering  djb  dmitry-vyukov  dna  dole  dr-strangelove  druid.io  dublinbikes  duplicate-detection  duplicates  dynamic-programming  dystopia  earth  edit-distance  eigenfaces  elasticsearch  elias-gamma-coding  els  em  email  emergent  emergent-behaviour  encoding  equifax  erasure-coding  eric-moritz  eric-veach  error  estimation  etailers  ethics  eu  eurion  euro  europe  eve  eve-online  events  eventual-consistency  executor  experian  explainability  exponential-decay  eye-candy  face  face-recognition  facebook  faceception  facial-recognition  fail  failure  fair-mapping  fairness  fake-news  false-positives  fascism  fastbit  fault-tolerance  feature-extraction  feedback  feistel-network  ffffound  fibonacci  filesystems  filtering  finite-state-entropy-coding  flajolet  flapping  flash  fleet  floating-point  floyd-steinberg  flp  fm-index  forecast  forecast.io  formal-methods  frame-of-reference  fred-benenson  frequency  frequency-tables  frogsort  fs  fse  fsm  fud  full-cycle  full-text-search  fun  funny  furore  future  fuzzy-matching  fuzzy-search  gallery  games  gaming  gangs  garbage-collection  gc  gcs  gdpr  geek  gems  gene-editing  genes  genetics  genitalia  genome  genomics  geodata  geometry  gif  giraph  gis  github  gliders  go  golden-ratio  google  google-news  government  gpus  graph  graph-processing  graphchi  graphics  graphs  greenwald-khanna  grim-meathook-future  guava  gzip  ha  hacker-news  hacking  hacks  hadoop  half-life  hamming  hamming-distance  haproxy  hardware  harrassment  hash  hash-tables  hashed-wheel-timer  hashed-wheel-timers  hashes  hashing  hashmaps  hashtables  hbos  hdfs  heap  heaps  heuristics  hft  highwayhash  hiring  histogram  histograms  history  hlc  hll  hlld  hmm  hoaxes  hoeffding-tree  horizon-charts  hot-spots  hrw  huffman  hyper-log-log  hyperbitbit  hyperloglog  hyperloglogplus  hyperlogsandwich  iblt  ilya-grigorik  image  images  immigration  indexes  indexing  industrial  influxdb  infoq  injector  insane  insurance  integers  intersection  intersections  interviews  investing  ip  iphone  isis  jaq  jaro-winkler  java  java-8  java8  javaewah  javascript  jdk  jeff-erickson  jenks-natural-breaks  jiq  john-ousterhout  join-idle-queue  jokes  js  justice  jvm  k-means  k-means-clustering  k-nearest-neighbour  k-shortest-paths  kafka  key-value-stores  knn  la  labelling  labor  labour  lapetus  las-vegas  latency  law  law-enforcement  laws  lbs  lcg  lcs  leader-election  learning  lectures  legibility  leveldb  levenshtein  levenshtein-distance  lfsr  liberty  licensing  life  life-expectancy  life-insurance  linear-counting  linear-regression  linkedin  linux  lists  lmax  load  load-balancers  load-balancing  load-distribution  lock-free  locking  logical-clocks  loglog  loglog-counting  logs  london  lookup3  looping  lossy  lru  lsm-trees  lua  lucene  mac  machine-learning  mailinator  majority  management  map-reduce  mapping  mapreduce  maps  mark-and-sweep  martin-kleppman  mass-trees  materiality  math  mathematics  maths  mathwashing  mazes  mean  mechanical-sympathy  media  median  memcached  memory  memory-layout  merge-sort  merging  met-police  metamarkets  metrics  microservices  microsoft  mike-bostock  minhash  mit  ml  mltshp  modulo  money  mongodb  monte-carlo-simulation  multicore  multisets  multithreading  murmurhash  music  mutexes  naive-bayes  named-entities  nbta  near-neighbour-search  netflix  netty  networking  neural-networks  news  newsfeed  nitsanw  nlp  non-blocking  northpointe  nosql  notting-hill-carnival  nsfw  number-sequences  numbers  numeric  nvidia  nytimes  o(1)  object-pools  object-spam  obscurity  ocr  one-pass  online  open-source  opencv  openjdk  openmp  ops  optimisation  optimization  ordering  outliers  overload  pagerank  palantir  paper  papers  parallel  parsing  partitioning  patches  patents  patterns  paxos  paypal  pdf  percentiles  performance  permutation  peter-bailis  peter-norvig  petri-dish  pfor  phasers  pheme  photos  photoshop  pid-controller  piece-work  pixar  poisson  police  policing  politics  popularity  populations  power  precrime  presentation  presentations  prime-numbers  primitives  print-on-demand  priority-queue  privacy  probabilistic  probability  programming  proofs  propaganda  protest  proxies  puzzles  pymovie  python  q-digest  qcon  quadtree  quantiles  querying  questions  queueing  queues  quickselect  quicksilver  quines  quora  quotes  rabin-karp  racism  radix-sort  raft  rahman-algorithm  raid  rand  random  randomization  randomness  rank  ranking  rape  rappor  rating  read-alignment  reading  real-world  rebalancing  recipes  recommendations  recommenders  recordinality  redis  redlock  reduce  redundancy  reed-solomon  refcounting  reference  regex  regexps  regular-expressions  reliability  remycc  renderman  rendezvous-hashing  repair  replication  research  reservoir-sampling  responsibility  reverse-engineering  reversing  riak  right-to-explanation  rights  ring-buffers  ripq  rivers  rle  roaring-bitmaps  rob-pike  robin-hood-hashing  robo-debt  rocksdb  round-robin  routing  rtt  ruby  rule-discovery  rules  russia  rust  sampling  samtools  scala  scalability  scaling  scandals  scheduling  scoring  search  searching  sec  security  selfies  sentencing  sequencing  set  set-cover  set-membership  sets  sgd  sha1  sha3  sharding  shingling  shuffle  shuffling  silicon-valley  simd  similarity  siphash  sketch  sketches  sketching  skew  skip-lists  skiplists  slashdot  slicing  slides  smbc  snappy  snapshots  snippets  soa  social-media  society  software  solid-gold-bomb  solving  sort  sort-based-shuffle  sorting  sorting-networks  sought  sound  source-code-analysis  space  space-efficient  space-saving  spam  spamassassin  spark  sparrow  sparse-graphs  spatial  speed  spidergrams  spinedbuffer  spotify  spsc  sql  sql-server  sr-iov  ssd  sstables  stack-overflow  static-code-analysis  statistics  steven-skiena  stock  stock-market  storage  stream-processing  stream-summary  streaming  streamlib  streams  string-matching  string-search  strings  succinct-encoding  sudoku  supervised  surveillance  svm  swiss  symbol-alphabets  synchronization  syria  sysadmin  systems  t-digest  t-shirts  taylorism  tcp  tech  techcrunch  technology  temperature  tesla  tests  text  text-analysis  text-matching  tflops  threading  threadpool  threads  throughput  time  time-machine  time-series  timer-wheels  timers  timing-wheels  timsort  tinder  tinylfu  tips  tokenization  tokumx  top-k  toread  trading  training  transactions  transparency  trees  trial-by-algorithm  tries  truetime  trump  truth  tsd  tsm-trees  twitter  twoheadlines  typography  uber  ui  uiuc  uk  unicode  union  unlabelled-learning  unsupervised  unsupervised-learning  uptime  us-politics  usenix  utf8  variability  variance  vector  vectors  version-control  via-dehora  via:adriancolyer  via:bradfitz  via:cliffc  via:codeslinger  via:craig  via:cscotta  via:daveb  via:eoinbrazil  via:fanf  via:hackernews  via:highscalability  via:igrigorik  via:jgc  via:jwz  via:jzawodny  via:kragen  via:marcbrooker  via:nelson  via:norman-maurer  via:paperswelove  via:patrickmcfadin  via:pentadact  via:pheezy  via:proggit  via:qwghlm  via:raycorrigan  via:sergio-bossa  via:waxy  video  vimeo  viral  visualisation  visualization  vividcortex  volatility  voldemort  wah  wait-free  war  waves  weather  webgl  webp  wegman  weight  weighting  white-helmets  wikis  wilson-score-interval  wisconsin  word-frequency  work  x-stream  xkcd  youtube  zero-shot  zfs  zip  zookeeper  zopfli 

Copy this bookmark:



description:


tags: