foodbaby + word   41

Factors Influencing the Surprising Instability of Word Embeddings
Despite the recent popularity of word embedding
methods, there is only a small body of
work exploring the limitations of these representations.
In this paper, we consider one aspect
of embedding spaces, namely their stability.
We show that even relatively high frequency
words (100-200 occurrences) are often
unstable. We provide empirical evidence for
how various factors contribute to the stability
of word embeddings, and we analyze the effects
of stability on downstream tasks.
word  embeddings  word2vec  evaluation 
june 2018 by foodbaby
[1801.00388] Beyond Word Embeddings: Learning Entity and Concept Representations from Large Scale Knowledge Bases
Text representation using neural word embeddings has proven efficacy in many NLP applications. Recently, a lot of research interest goes beyond word embeddings by adapting the traditional word embedding models to learn vectors of multiword expressions (concepts/entities). However, current methods are limited to textual knowledge bases only (e.g., Wikipedia). In this paper, we propose a novel approach for learning concept vectors from two large scale knowledge bases (Wikipedia, and Probase). We adapt the skip-gram model to seamlessly learn from the knowledge in Wikipedia text and Probase concept graph. We evaluate our concept embedding models intrinsically on two tasks: 1) analogical reasoning where we achieve a state-of-the-art performance of 91% on semantic analogies, 2) concept categorization where we achieve a state-of-the-art performance on two benchmark datasets achieving categorization accuracy of 100% on one and 98% on the other. Additionally, we present a case study to extrinsically evaluate our model on unsupervised argument type identification for neural semantic parsing. We demonstrate the competitive accuracy of our unsupervised method and its ability to better generalize to out of vocabulary entity mentions compared to the tedious and error prone methods which depend on gazetteers and regular expressions.
concept  word  embeddings 
january 2018 by foodbaby
Advances in Pre-Training Distributed Word Representations | Hacker News
Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available pre-trained models that outperform the current state of the art by a large margin on a number of tasks.
pre-trained  word  embeddings 
december 2017 by foodbaby
GitHub - commonsense/conceptnet-numberbatch
ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning.
word  embeddings 
december 2017 by foodbaby
[1602.01137] A Dual Embedding Space Model for Document Ranking
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
word  embeddings  IR  relevance  papers  2016 
december 2017 by foodbaby
Using Text Embeddings for Information Retrieval
theory and intuition behind word embeddings plus applications
word  embeddings  IR  slides  2016  session2vec  theory  word2vec 
december 2017 by foodbaby
[1708.07903] Nationality Classification Using Name Embeddings
Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained classification.
We exploit the phenomena of homophily in communication patterns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classifiers and other systems. Through our analysis of 57M contact lists from a major Internet company, we are able to design a fine-grained nationality classifier covering 39 groups representing over 90% of the world population. In an evaluation against other published systems over 13 common classes, our F1 score (0.795) is substantial better than our closest competitor Ethnea (0.580). To the best of our knowledge, this is the most accurate, fine-grained nationality classifier available.
As a social media application, we apply our classifiers to the followers of major Twitter celebrities over six different domains. We demonstrate stark differences in the ethnicities of the followers of Trump and Obama, and in the sports and entertainments favored by different groups. Finally, we identify an anomalous political figure whose presumably inflated following appears largely incapable of reading the language he posts in.
ethnicity  word  embeddings  papers 
december 2017 by foodbaby
Relevance Feedback-based Query Expansion Model using Ranks Combining and Word2Vec... | Request PDF
Query expansion is a well-known method for improving the performance of information retrieval systems. Pseudo-relevance feedback (PRF)-based query expansion is a type of query expansion approach that assumes the top-ranked retrieved documents are relevant. The addition of all the terms of PRF documents is not important or appropriate for expanding the original user query. Hence, the selection of proper expansion term is very important for improving retrieval system performance. Various individual query expansion term selection methods have been widely investigated for improving system performance. Every individual expansion term selection method has its own weaknesses and strengths. In order to minimize the weaknesses and utilizing the strengths of the individual method, we used multiple terms selection methods together. First, this paper explored the possibility of improving overall system performance by using individual query expansion terms selection methods. Further, ranks-aggregating method named Borda count is used for combining multiple query expansion terms selection methods. Finally, Word2vec approach is used to select semantically similar terms with query after applying Borda count rank combining approach. Our experimental results on both data-sets TREC and FIRE demonstrated that our proposed approaches achieved significant improvement over each individual terms selection method and other's related state-of-the-art method.
relevance  feedback  query  expansion  word  embeddings  IR  papers 
december 2017 by foodbaby
Query Expansion with Locally-Trained Word Embeddings - Microsoft Research
Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddings for retrieval tasks. These results suggest that other tasks benefiting from global embeddings may also benefit from local embeddings.
word  embeddings  query  expansion  IR  papers 
december 2017 by foodbaby
[1606.07608] Using Word Embeddings for Automatic Query Expansion
In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neural language model word2vec. Using semantic and contextual relation in a distributed and unsupervised framework, word2vec learns a low dimensional embedding for each vocabulary entry. Using such a framework, we devise a query expansion technique, where related terms to a query are obtained by K-nearest neighbor approach. We explore the performance of the AQE methods, with and without feedback query expansion, and a variant of simple K-nearest neighbor in the proposed framework. Experiments on standard TREC ad-hoc data (Disk 4, 5 with query sets 301-450, 601-700) and web data (WT10G data with query set 451-550) shows significant improvement over standard term-overlapping based retrieval methods. However the proposed method fails to achieve comparable performance with statistical co-occurrence based feedback method such as RM3. We have also found that the word2vec based query expansion methods perform similarly with and without any feedback information.
word  embeddings  query  expansion  IR  papers 
december 2017 by foodbaby
Word Embedding Causes Topic Shifting; Exploit Global Context! - Semantic Scholar
Exploitation of term relatedness provided by word embedding has gained considerable attention in recent IR literature. However, an emerging question is whether this sort of relatedness fits to the needs of IR with respect to retrieval effectiveness. While we observe a high potential of word embedding as a resource for related terms, the incidence of several cases of topic shifting deteriorates the final performance of the applied retrieval models. To address this issue, we revisit the use of global context (i.e. the term co-occurrence in documents) to measure the term relatedness. We hypothesize that in order to avoid topic shifting among the terms with high word embedding similarity, they should often share similar global contexts as well. We therefore study the effectiveness of post filtering of related terms by various global context relatedness measures. Experimental results show significant improvements in two out of three test collections, and support our initial hypothesis regarding the importance of considering global context in retrieval.
word2vec  word  embeddings  IR 
december 2017 by foodbaby
Joint Embedding of Query and Ad by Leveraging Implicit Feedback - Semantic Scholar
Sponsored search is at the center of a multibillion dollar market established by search technology. Accurate ad click prediction is a key component for this market to function since the pricing mechanism heavily relies on the estimation of click probabilities. Lexical features derived from the text of both the query and ads play a significant role, complementing features based on historical click information. The purpose of this paper is to explore the use of word embedding techniques to generate effective text features that can capture not only lexical similarity between query and ads but also the latent user intents. We identify several potential weaknesses of the plain application of conventional word embedding methodologies for ad click prediction. These observations motivated us to propose a set of novel joint word embedding methods by leveraging implicit click feedback. We verify the effectiveness of these new word embedding models by adding features derived from the new models to the click prediction system of a commercial search engine. Our evaluation results clearly demonstrate the effectiveness of the proposed methods. To the best of our knowledge this work is the first successful application of word embedding techniques for the task of click prediction in sponsored search.
ctr  prediction  word  embeddings  papers  sponsored  search  click2vec 
november 2017 by foodbaby
Relevance-based Word Embedding - Semantic Scholar
Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned based on term proximity in a large corpus. This means that the objective in well-known word embedding algorithms, e.g., word2vec, is to accurately predict adjacent word(s) for a given word or context. However, this objective is not necessarily equivalent to the goal of many information retrieval (IR) tasks. The primary objective in various IR tasks is to capture <i>relevance</i> instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query. To train our models, we used over six million unique queries and the top ranked documents retrieved in response to each query, which are assumed to be relevant to the query. We extrinsically evaluate our learned word representation models using two IR tasks: query expansion and query classification. Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding models, such as word2vec and GloVe.
IR  word  embeddings 
november 2017 by foodbaby
Relevance-based Word Embedding - Semantic Scholar
Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned based on term proximity in a large corpus. This means that the objective in well-known word embedding algorithms, e.g., word2vec, is to accurately predict adjacent word(s) for a given word or context. However, this objective is not necessarily equivalent to the goal of many information retrieval (IR) tasks. The primary objective in various IR tasks is to capture <i>relevance</i> instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query. To train our models, we used over six million unique queries and the top ranked documents retrieved in response to each query, which are assumed to be relevant to the query. We extrinsically evaluate our learned word representation models using two IR tasks: query expansion and query classification. Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding models, such as word2vec and GloVe.
IR  word  embeddings  word2vec 
november 2017 by foodbaby
Query Expansion with Locally-Trained Word Embeddings - Semantic Scholar
Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddings for retrieval tasks. These results suggest that other tasks benefiting from global embeddings may also benefit from local embeddings.
IR  query  expansion  word  embeddings  papers  word2vec 
november 2017 by foodbaby
Understanding Styles in Microsoft Word - A Tutorial in the Intermediate Users Guide to Microsoft Word
Understanding Styles in Microsoft Word. A chapter in the Intermediate User's Guide to Microsoft Word.
imported  Computing  Microsoft_Word  word  styles  tutorial  msword  microsoft  tips  reference 
june 2009 by foodbaby

Copy this bookmark:



description:


tags: