arsyed + hashing   65

Learning to Hash for Indexing Big Data—A Survey - IEEE Journals & Magazine
The explosive growth in Big Data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, approximate nearest neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., locality-sensitive hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning-to-hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros, and cons of the emerging techniques. We provide a comprehensive survey of the learning-to-hash framework and representative techniques of various types, including unsupervised, semisupervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.
papers  surveys  learning-to-hash  hashing  indexing  searching 
december 2017 by arsyed
[1710.06993] Improved Search in Hamming Space using Deep Multi-Index Hashing
"Similarity-preserving hashing is a widely-used method for nearest neighbour search in large-scale image retrieval tasks. There has been considerable research on generating efficient image representation via the deep-network-based hashing methods. However, the issue of efficient searching in the deep representation space remains largely unsolved. To this end, we propose a simple yet efficient deep-network-based multi-index hashing method for simultaneously learning the powerful image representation and the efficient searching. To achieve these two goals, we introduce the multi-index hashing (MIH) mechanism into the proposed deep architecture, which divides the binary codes into multiple substrings. Due to the non-uniformly distributed codes will result in inefficiency searching, we add the two balanced constraints at feature-level and instance-level, respectively. Extensive evaluations on several benchmark image retrieval datasets show that the learned balanced binary codes bring dramatic speedups and achieve comparable performance over the existing baselines."
papers  hashing  search  hamming  via:vaguery 
november 2017 by arsyed
[1706.03993] Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks
"Recommendation algorithms that incorporate techniques from deep learning are becoming increasingly popular. Due to the structure of the data coming from recommendation domains (i.e., one-hot-encoded vectors of item preferences), these algorithms tend to have large input and output dimensionalities that dominate their overall size. This makes them difficult to train, due to the limited memory of graphical processing units, and difficult to deploy on mobile devices with limited hardware. To address these difficulties, we propose Bloom embeddings, a compression technique that can be applied to the input and output of neural network models dealing with sparse high-dimensional binary-coded instances. Bloom embeddings are computationally efficient, and do not seriously compromise the accuracy of the model up to 1/5 compression ratios. In some cases, they even improve over the original accuracy, with relative increases up to 12%. We evaluate Bloom embeddings on 7 data sets and compare it against 4 alternative methods, obtaining favorable results. We also discuss a number of further advantages of Bloom embeddings, such as 'on-the-fly' constant-time operation, zero or marginal space requirements, training time speedups, or the fact that they do not require any change to the core model architecture or training configuration."
papers  recsys  dimensionality-reduction  hashing 
july 2017 by arsyed
[1509.05472] Learning to Hash for Indexing Big Data - A Survey
"The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area."
papers  surveys  algorithms  hashing  machine-learning  learning-to-hash 
september 2015 by arsyed
[0902.2206] Feature Hashing for Large Scale Multitask Learning
Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case -- multitask learning with hundreds of thousands of tasks.
papers  machine-learning  hashing  feature-hashing  multitask-learning 
july 2015 by arsyed
Hashing for Distributed Data | ICML 2015 | JMLR W&CP
Recently, hashing based approximate nearest neighbors search has attracted much attention. Extensive centralized hashing algorithms have been proposed and achieved promising performance. However, due to the large scale of many applications, the data is often stored or even collected in a distributed manner. Learning hash functions by aggregating all the data into a fusion center is infeasible because of the prohibitively expensive communication and computation overhead. In this paper, we develop a novel hashing model to learn hash functions in a distributed setting. We cast a centralized hashing model as a set of subproblems with consensus constraints. We find these subproblems can be analytically solved in parallel on the distributed compute nodes. Since no training data is transmitted across the nodes in the learning process, the communication cost of our model is independent to the data size. Extensive experiments on several large scale datasets containing up to 100 million samples demonstrate the efficacy of our method.
papers  algorithms  hashing  knn  approximation  parallel  scaling 
june 2015 by arsyed
[1504.04788] Compressing Neural Networks with the Hashing Trick
As deep nets are increasingly used in applications suited for mobile devices, a fundamental dilemma becomes apparent: the trend in deep learning is to grow models to absorb ever-increasing data set sizes; however mobile devices are designed with very little memory and cannot store such large models. We present a novel network architecture, HashedNets, that exploits inherent redundancy in neural networks to achieve drastic reductions in model sizes. HashedNets uses a low-cost hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value. These parameters are tuned to adjust to the HashedNets weight sharing architecture with standard backprop during training. Our hashing procedure introduces no additional memory overhead, and we demonstrate on several benchmark data sets that HashedNets shrink the storage requirements of neural networks substantially while mostly preserving generalization performance.
papers  neural-net  compression  hashing  hashing-trick  model-compression 
april 2015 by arsyed
LARGE-SCALE SPEAKER IDENTIFICATION
"Speaker identification is one of the main tasks in speech processing. In addition to identification accuracy, large-scale applications of speaker identification give rise to another challenge: fast search in the database of speakers. In this paper, we propose a system based on i-vectors, a current approach for speaker identification, and locality sensitive hashing, an algorithm for fast nearest neighbor search in high dimensions. The connection between the two techniques is the cosine distance: on the one hand, we use the cosine distance to compare i-vectors, on the other hand, locality sensitive hashing allows us to quickly approximate the cosine distance in our retrieval procedure. We evaluate our approach on a realistic data set from YouTube with about 1,000 speakers. The results show that our algorithm is approximately one to two orders of magnitude faster than a linear search while maintaining the identification accuracy of an i-vector-based system."
papers  speaker-recognition  i-vector  scaling  searching  indexing  knn  hashing 
february 2015 by arsyed
Performance comparison among LSH Forest, ANNOY and FLANN
"It is evident that index building times of LSH Forest and FLANN are almost incomparable to that of Annoy for almost all the datasets. Moreover, for larger datasets, LSH Forest outperforms Annoy at large margins with respect to accuracy and query speed. Observations from these graphs prove that LSH Forest is competitive with FLANN for large datasets."
python  sklearn  knn  lsh  flann  annoy  hashing  indexing  performance  comparison 
august 2014 by arsyed
andrewclegg/sketchy
"Simple approximate-nearest-neighbours in Python using locality sensitive hashing."
python  libs  knn  lsh  hashing  approximate-nearest-neighbors 
july 2014 by arsyed
Feature hashing for large scale multitask learning
"Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case --- multitask learning with hundreds of thousands of tasks."
papers  hashing  machine-learning  multitask-learning 
april 2014 by arsyed
yahoo/Optimal-LSH
"This package provides code to implement locality-sensitive hashing (LSH)
in an optimum fashion.

There are two pieces. A Python library that implements LSH and a Matlab
routine that calculates the optimum parameters for LSH."
hashing  lsh  code  python  matlab  performance 
march 2014 by arsyed
Python
"Generate short hashes from numbers (like YouTube and Bitly)."
python  libs  hashing  string 
march 2014 by arsyed
StefanKarpinski/bam
"Bam: fast, simple TSV key-value server"
software  database  cache  hashing  key-value  c 
january 2014 by arsyed
Learning Hash Functions Using Column Generation | ICML 2013 | JMLR W&CP
"Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets."
papers  machine-learning  hashing  margin-methods  knn 
november 2013 by arsyed
Efficient spoken term discovery using randomized algorithms (Jansen, A., Van Durme, B.)
"Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warping-based searches across the entire similarity matrix, a method whose scalability is ultimately limited by the O(n2) nature of the search space. Recent strategies have attempted to improve search efficiency by using either unsupervised or mismatched-language acoustic models to reduce the complexity of the feature representation. Taking a completely different approach, this paper investigates the use of randomized algorithms that operate directly on the raw acoustic features to produce sparse approximate similarity matrices in O(n) space and O(n log n) time. We demonstrate these techniques facilitate spoken term discovery performance capable of outperforming a model-based strategy in the zero resource setting."
papers  spoken-term-discovery  speech  asr  lsh  dtw  hashing 
july 2013 by arsyed
Nearest Neighbor Retrieval Using Distance-Based Hashing
"A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including non-metric distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing."
papers  knn  search  distance  hashing  dtw 
july 2013 by arsyed
Indexing Raw Acoustic Features for Scalable Zero Resource Search
"We present a new speech indexing and search scheme called Randomized Acoustic Indexing and Logarithmic-time Search (RAILS) that enables scalable query-by-example spoken term detection in the zero resource regime. RAILS is derived from our recent investigation into the application of randomized hashing and approximate nearest neighbor search algorithms to raw acoustic features. Our approach permits an approximate search through hundreds of hours of speech audio in a matter of seconds, and may be applied to any language without the need of a training corpus, acoustic model, or pronunciation lexicon. The fidelity of the approximation is controlled through a small number of easily interpretable parameters that allow a trade-off between search accuracy and speed."
hashing  indexing  dtw  lsh  speech  algorithms  performance 
july 2013 by arsyed
Don’t Hash Secrets (Ben Adida)
"So, again, don’t hash secrets. HMAC them."
security  hashing  encryption 
february 2010 by arsyed
why publishing your email's hash is not a good idea
"Running my program on a list of 80871 users I was able to extract 8597 email addresses, associated to their users. This means that for a bit more than 10% of the users, the username and the gravatar URL are enough to deduce the email address they used to register to the website."
email  hashing  security  spam 
january 2010 by arsyed
Hash Table with Preemptive Bloom Filter (Michael Mitzenmacher)
"Before you do a lookup into the table, you check the Bloom filter. A false positive in the Bloom filter means you do a lookup when you didn't have to, but this allows you to avoid a lookup for a key that's not in your hash table the large majority of the time. The approach works for any type of hash table."
bloom  filter  hashing  performance 
march 2009 by arsyed
Astrometry.net
"If you have astronomical imaging of the sky with celestial coordinates you do not know— or do not trust — then Astrometry.net is for you. Input an image and we'll give you back astrometric calibration meta-data, plus lists of known objects falling inside the field of view."
astronomy  astrometry  image  dip  geometric  hashing 
february 2009 by arsyed
Anti-RDBMS: A list of distributed key-value stores (Richard Jones)
"This article represents my notes and research to date on distributed key-value stores (and some other stuff) that might be suitable as RDBMS replacements under the right conditions."
database  keyvalue  hashing  comparison 
january 2009 by arsyed
tuulos's ringo at master — GitHub
"Ringo is an experimental, distributed, replicating key-value store based on consistent hashing and immutable data. Unlike many general-purpose databases, Ringo is designed for a specific use case: For archiving small (less than 4KB) or medium-size data items (<100MB) in real-time so that the data can survive K - 1 disk breaks, where K is the desired number of replicas, without any downtime, in a manner that scales to terabytes of data. ... An experimental interface for Disco .. which implements the Disco's map reader interface, makes it possible to use data stored in Ringo as input to a Disco job. "
database  distributed  consistent  hashing  mapReduce  erlang  python  caching 
december 2008 by arsyed
The Hash Function Lounge
"The goal of this site is to give an overview of hash functions intended for cryptographic applications."
algorithms  hashing  crypto 
november 2008 by arsyed

related tags

algorithms  all-pairs  ann  annoy  approximate-nearest-neighbors  approximation  asr  astrometry  astronomy  audio  basics  bloom  bloom-filter  bloom-filters  c  cache  caching  code  collaborative-filtering  comparison  compression  computer-vision  consistent  crypto  database  dataproc  dedup  dimensionality-reduction  dimsum  dip  distance  distance-matrix  distributed  dtw  email  encryption  erlang  feature-hashing  feature-selection  filter  flann  geometric  hamming  hashing  hashing-trick  highdim  i-vector  image  indexing  ir  jaccard  kaggle  key-value  keyvalue  knn  learning-to-hash  libs  locality-sensitive-hashing  logistic-regression  lsh  mac  machine-learning  mapReduce  margin-methods  matlab  mega  memory  metric-learning  minhash  model-compression  mp3  multitask-learning  neural-net  papers  parallel  password  people  performance  prefix-filtering  python  recsys  ref  research  scalability  scaling  search  searching  security  simhashing  similarity  sketching  sklearn  software  spam  sparsity  speaker-recognition  speech  spoken-term-discovery  string  surveys  text  tips  tricks  tticks  twitter  via:euler  via:vaguery  voldemort 

Copy this bookmark:



description:


tags: