Learning to Hash for Indexing Big Data—A Survey - IEEE Journals & Magazine

december 2017 by arsyed

The explosive growth in Big Data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, approximate nearest neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., locality-sensitive hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning-to-hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros, and cons of the emerging techniques. We provide a comprehensive survey of the learning-to-hash framework and representative techniques of various types, including unsupervised, semisupervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.

papers
surveys
learning-to-hash
hashing
indexing
searching
december 2017 by arsyed

[1710.06993] Improved Search in Hamming Space using Deep Multi-Index Hashing

november 2017 by arsyed

"Similarity-preserving hashing is a widely-used method for nearest neighbour search in large-scale image retrieval tasks. There has been considerable research on generating efficient image representation via the deep-network-based hashing methods. However, the issue of efficient searching in the deep representation space remains largely unsolved. To this end, we propose a simple yet efficient deep-network-based multi-index hashing method for simultaneously learning the powerful image representation and the efficient searching. To achieve these two goals, we introduce the multi-index hashing (MIH) mechanism into the proposed deep architecture, which divides the binary codes into multiple substrings. Due to the non-uniformly distributed codes will result in inefficiency searching, we add the two balanced constraints at feature-level and instance-level, respectively. Extensive evaluations on several benchmark image retrieval datasets show that the learned balanced binary codes bring dramatic speedups and achieve comparable performance over the existing baselines."

papers
hashing
search
hamming
via:vaguery
november 2017 by arsyed

[1706.03993] Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks

july 2017 by arsyed

"Recommendation algorithms that incorporate techniques from deep learning are becoming increasingly popular. Due to the structure of the data coming from recommendation domains (i.e., one-hot-encoded vectors of item preferences), these algorithms tend to have large input and output dimensionalities that dominate their overall size. This makes them difficult to train, due to the limited memory of graphical processing units, and difficult to deploy on mobile devices with limited hardware. To address these difficulties, we propose Bloom embeddings, a compression technique that can be applied to the input and output of neural network models dealing with sparse high-dimensional binary-coded instances. Bloom embeddings are computationally efficient, and do not seriously compromise the accuracy of the model up to 1/5 compression ratios. In some cases, they even improve over the original accuracy, with relative increases up to 12%. We evaluate Bloom embeddings on 7 data sets and compare it against 4 alternative methods, obtaining favorable results. We also discuss a number of further advantages of Bloom embeddings, such as 'on-the-fly' constant-time operation, zero or marginal space requirements, training time speedups, or the fact that they do not require any change to the core model architecture or training configuration."

papers
recsys
dimensionality-reduction
hashing
july 2017 by arsyed

[1509.05472] Learning to Hash for Indexing Big Data - A Survey

september 2015 by arsyed

"The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area."

papers
surveys
algorithms
hashing
machine-learning
learning-to-hash
september 2015 by arsyed

[0902.2206] Feature Hashing for Large Scale Multitask Learning

july 2015 by arsyed

Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case -- multitask learning with hundreds of thousands of tasks.

papers
machine-learning
hashing
feature-hashing
multitask-learning
july 2015 by arsyed

Compressing Neural Networks with the Hashing Trick | ICML 2015 | JMLR W&CP

june 2015 by arsyed

As deep nets are increasingly used in applications suited for mobile devices, a fundamental dilemma becomes apparent: the trend in deep learning is to grow models to absorb ever-increasing data set sizes; however mobile devices are designed with very little memory and cannot store such large models. We present a novel network architecture, HashedNets, that exploits inherent redundancy in neural networks to achieve drastic reductions in model sizes. HashedNets uses a low-cost hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value. These parameters are tuned to adjust to the HashedNets weight sharing architecture with standard backprop during training. Our hashing procedure introduces no additional memory overhead, and we demonstrate on several benchmark data sets that HashedNets shrink the storage requirements of neural networks substantially while mostly preserving generalization performance.

papers
hashing
hashing-trick
neural-net
model-compression
machine-learning
june 2015 by arsyed

Hashing for Distributed Data | ICML 2015 | JMLR W&CP

june 2015 by arsyed

Recently, hashing based approximate nearest neighbors search has attracted much attention. Extensive centralized hashing algorithms have been proposed and achieved promising performance. However, due to the large scale of many applications, the data is often stored or even collected in a distributed manner. Learning hash functions by aggregating all the data into a fusion center is infeasible because of the prohibitively expensive communication and computation overhead. In this paper, we develop a novel hashing model to learn hash functions in a distributed setting. We cast a centralized hashing model as a set of subproblems with consensus constraints. We find these subproblems can be analytically solved in parallel on the distributed compute nodes. Since no training data is transmitted across the nodes in the learning process, the communication cost of our model is independent to the data size. Extensive experiments on several large scale datasets containing up to 100 million samples demonstrate the efficacy of our method.

papers
algorithms
hashing
knn
approximation
parallel
scaling
june 2015 by arsyed

[1504.04788] Compressing Neural Networks with the Hashing Trick

april 2015 by arsyed

As deep nets are increasingly used in applications suited for mobile devices, a fundamental dilemma becomes apparent: the trend in deep learning is to grow models to absorb ever-increasing data set sizes; however mobile devices are designed with very little memory and cannot store such large models. We present a novel network architecture, HashedNets, that exploits inherent redundancy in neural networks to achieve drastic reductions in model sizes. HashedNets uses a low-cost hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value. These parameters are tuned to adjust to the HashedNets weight sharing architecture with standard backprop during training. Our hashing procedure introduces no additional memory overhead, and we demonstrate on several benchmark data sets that HashedNets shrink the storage requirements of neural networks substantially while mostly preserving generalization performance.

papers
neural-net
compression
hashing
hashing-trick
model-compression
april 2015 by arsyed

LARGE-SCALE SPEAKER IDENTIFICATION

february 2015 by arsyed

"Speaker identification is one of the main tasks in speech processing. In addition to identification accuracy, large-scale applications of speaker identification give rise to another challenge: fast search in the database of speakers. In this paper, we propose a system based on i-vectors, a current approach for speaker identification, and locality sensitive hashing, an algorithm for fast nearest neighbor search in high dimensions. The connection between the two techniques is the cosine distance: on the one hand, we use the cosine distance to compare i-vectors, on the other hand, locality sensitive hashing allows us to quickly approximate the cosine distance in our retrieval procedure. We evaluate our approach on a realistic data set from YouTube with about 1,000 speakers. The results show that our algorithm is approximately one to two orders of magnitude faster than a linear search while maintaining the identification accuracy of an i-vector-based system."

papers
speaker-recognition
i-vector
scaling
searching
indexing
knn
hashing
february 2015 by arsyed

Performance comparison among LSH Forest, ANNOY and FLANN

august 2014 by arsyed

"It is evident that index building times of LSH Forest and FLANN are almost incomparable to that of Annoy for almost all the datasets. Moreover, for larger datasets, LSH Forest outperforms Annoy at large margins with respect to accuracy and query speed. Observations from these graphs prove that LSH Forest is competitive with FLANN for large datasets."

python
sklearn
knn
lsh
flann
annoy
hashing
indexing
performance
comparison
august 2014 by arsyed

andrewclegg/sketchy

july 2014 by arsyed

"Simple approximate-nearest-neighbours in Python using locality sensitive hashing."

python
libs
knn
lsh
hashing
approximate-nearest-neighbors
july 2014 by arsyed

Mohammad Norouzi

may 2014 by arsyed

Hamming Distance Metric Learning,

people
research
computer-vision
machine-learning
metric-learning
hamming
knn
hashing
may 2014 by arsyed

Feature hashing for large scale multitask learning

april 2014 by arsyed

"Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case --- multitask learning with hundreds of thousands of tasks."

papers
hashing
machine-learning
multitask-learning
april 2014 by arsyed

yahoo/Optimal-LSH

march 2014 by arsyed

"This package provides code to implement locality-sensitive hashing (LSH)

in an optimum fashion.

There are two pieces. A Python library that implements LSH and a Matlab

routine that calculates the optimum parameters for LSH."

hashing
lsh
code
python
matlab
performance
in an optimum fashion.

There are two pieces. A Python library that implements LSH and a Matlab

routine that calculates the optimum parameters for LSH."

march 2014 by arsyed

StefanKarpinski/bam

january 2014 by arsyed

"Bam: fast, simple TSV key-value server"

software
database
cache
hashing
key-value
c
january 2014 by arsyed

Learning Hash Functions Using Column Generation | ICML 2013 | JMLR W&CP

november 2013 by arsyed

"Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets."

papers
machine-learning
hashing
margin-methods
knn
november 2013 by arsyed

Efficient spoken term discovery using randomized algorithms (Jansen, A., Van Durme, B.)

july 2013 by arsyed

"Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warping-based searches across the entire similarity matrix, a method whose scalability is ultimately limited by the O(n2) nature of the search space. Recent strategies have attempted to improve search efficiency by using either unsupervised or mismatched-language acoustic models to reduce the complexity of the feature representation. Taking a completely different approach, this paper investigates the use of randomized algorithms that operate directly on the raw acoustic features to produce sparse approximate similarity matrices in O(n) space and O(n log n) time. We demonstrate these techniques facilitate spoken term discovery performance capable of outperforming a model-based strategy in the zero resource setting."

papers
spoken-term-discovery
speech
asr
lsh
dtw
hashing
july 2013 by arsyed

Nearest Neighbor Retrieval Using Distance-Based Hashing

july 2013 by arsyed

"A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including non-metric distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing."

papers
knn
search
distance
hashing
dtw
july 2013 by arsyed

Indexing Raw Acoustic Features for Scalable Zero Resource Search

july 2013 by arsyed

"We present a new speech indexing and search scheme called Randomized Acoustic Indexing and Logarithmic-time Search (RAILS) that enables scalable query-by-example spoken term detection in the zero resource regime. RAILS is derived from our recent investigation into the application of randomized hashing and approximate nearest neighbor search algorithms to raw acoustic features. Our approach permits an approximate search through hundreds of hours of speech audio in a matter of seconds, and may be applied to any language without the need of a training corpus, acoustic model, or pronunciation lexicon. The fidelity of the approximation is controlled through a small number of easily interpretable parameters that allow a trade-off between search accuracy and speed."

hashing
indexing
dtw
lsh
speech
algorithms
performance
july 2013 by arsyed

Don’t Hash Secrets (Ben Adida)

february 2010 by arsyed

"So, again, don’t hash secrets. HMAC them."

security
hashing
encryption
february 2010 by arsyed

why publishing your email's hash is not a good idea

january 2010 by arsyed

"Running my program on a list of 80871 users I was able to extract 8597 email addresses, associated to their users. This means that for a bit more than 10% of the users, the username and the gravatar URL are enough to deduce the email address they used to register to the website."

email
hashing
security
spam
january 2010 by arsyed

Hash Table with Preemptive Bloom Filter (Michael Mitzenmacher)

march 2009 by arsyed

"Before you do a lookup into the table, you check the Bloom filter. A false positive in the Bloom filter means you do a lookup when you didn't have to, but this allows you to avoid a lookup for a key that's not in your hash table the large majority of the time. The approach works for any type of hash table."

bloom
filter
hashing
performance
march 2009 by arsyed

Astrometry.net

february 2009 by arsyed

"If you have astronomical imaging of the sky with celestial coordinates you do not know— or do not trust — then Astrometry.net is for you. Input an image and we'll give you back astrometric calibration meta-data, plus lists of known objects falling inside the field of view."

astronomy
astrometry
image
dip
geometric
hashing
february 2009 by arsyed

Anti-RDBMS: A list of distributed key-value stores (Richard Jones)

january 2009 by arsyed

"This article represents my notes and research to date on distributed key-value stores (and some other stuff) that might be suitable as RDBMS replacements under the right conditions."

database
keyvalue
hashing
comparison
january 2009 by arsyed

tuulos's ringo at master — GitHub

december 2008 by arsyed

"Ringo is an experimental, distributed, replicating key-value store based on consistent hashing and immutable data. Unlike many general-purpose databases, Ringo is designed for a specific use case: For archiving small (less than 4KB) or medium-size data items (<100MB) in real-time so that the data can survive K - 1 disk breaks, where K is the desired number of replicas, without any downtime, in a manner that scales to terabytes of data. ... An experimental interface for Disco .. which implements the Disco's map reader interface, makes it possible to use data stored in Ringo as input to a Disco job. "

database
distributed
consistent
hashing
mapReduce
erlang
python
caching
december 2008 by arsyed

The Hash Function Lounge

november 2008 by arsyed

"The goal of this site is to give an overview of hash functions intended for cryptographic applications."

algorithms
hashing
crypto
november 2008 by arsyed

**related tags**

Copy this bookmark: