We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.
Factors Influencing the Surprising Instability of Word Embeddings
Despite the recent popularity of word embedding
methods, there is only a small body of
work exploring the limitations of these representations.
In this paper, we consider one aspect
of embedding spaces, namely their stability.
We show that even relatively high frequency
words (100-200 occurrences) are often
unstable. We provide empirical evidence for
how various factors contribute to the stability
of word embeddings, and we analyze the effects
of stability on downstream tasks.
How to easily do Topic Modeling with LSA, PSLA, LDA & lda2Vec
This article is a comprehensive overview of Topic Modeling and its associated techniques.
agnusmaximus/Word2Bits: Quantized word vectors that take 8x-16x less space than regular word vectors
Word vectors require significant amounts of memory and storage, posing issues to resource limited devices like mobile phones and GPUs. We show that high quality quantized word vectors using 1-2 bits per parameter can be learned by introducing a quantization function into Word2Vec. We furthermore show that training with the quantization function acts as a regularizer. We train word vectors on English Wikipedia (2017) and evaluate them on standard word similarity and analogy tasks and on question answering (SQuAD). Our quantized word vectors not only take 8-16x less space than full precision (32 bit) word vectors but also outperform them on word similarity tasks and question answering.
