15410
Abstract for Sparse Word Embeddings Using ℓ1 Regularized Online Learning - Semantic Scholar
Recently, Word2Vec tool has attracted a lot of interest for its promising performances in a variety of natural language processing (NLP) tasks. However, a critical issue is that the dense word representations learned in Word2Vec are lacking of interpretability. It is natural to ask if one could improve their interpretability while keeping their performances. Inspired by the success of sparse models in enhancing interpretability, we propose to introduce sparse constraint into Word2Vec. Specifically, we take the Continuous Bag of Words (CBOW) model as an example in our study and add the ` l regularizer into its learning objective. One challenge of optimization lies in that stochastic gradient descent (SGD) cannot directly produce sparse solutions with ` 1 regularizer in online training. To solve this problem, we employ the Regularized Dual Averaging (RDA) method, an online optimization algorithm for regularized stochastic learning. In this way, the learning process is very efficient and our model can scale up to very large corpus to derive sparse word representations. The proposed model is evaluated on both expressive power and interpretability. The results show that, compared with the original CBOW model, the proposed model can obtain state-of-the-art results with better interpretability using less than 10% non-zero elements.
embeddings  ML-interpretability 
2 days ago
Efficient Vector Representation for Documents through Corruption | OpenReview
We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.
word2vec  embeddings 
2 days ago
[1611.01116] Binary Paragraph Vectors
Recently Le & Mikolov described two log-linear models, called Paragraph Vector, that can be used to learn state-of-the-art distributed representations of documents. Inspired by this work, we present Binary Paragraph Vector models: simple neural networks that learn short binary codes for fast information retrieval. We show that binary paragraph vectors outperform autoencoder-based binary codes, despite using fewer bits. We also evaluate their precision in transfer learning settings, where binary codes are inferred for documents unrelated to the training corpus. Results from these experiments indicate that binary paragraph vectors can capture semantics relevant for various domain-specific documents. Finally, we present a model that simultaneously learns short binary codes and longer, real-valued representations. This model can be used to rapidly retrieve a short list of highly relevant documents from a large document collection.
IR  embeddings  index  papers 
2 days ago
DeepTest: automated testing of deep-neural-network-driven autonomous cars | the morning paper
In this paper, we design, implement and evaluate DeepTest, a systematic testing tool for automatically detecting erroneous behaviors of DNN-driven vehicles that can potentially lead to fatal crashes. First, our tool is designed to automatically generated test cases leveraging real-world changes in driving conditions like rain, fog, lighting conditions, etc. DeepTest systematically explores different parts of the DNN logic by generating test inputs that maximize the numbers of activated neurons. DeepTest found thousands of erroneous behaviors under different realistic driving conditions (e.g., blurring, rain, fog, etc.) many of which lead to potentially fatal crashes in three top performing DNNs in the Udacity self-driving car challenge.
DNN  testing 
3 days ago
« earlier      
20090622 2_visit ab ab-testing airlines airlines-flights analysis angularjs-vs architecture art asia auckland audio aws aws-lambda backup banking bayesian beijing_-_travel_-_what_to_do benchmark blog blogging blogging_software blogs books bpamp burma business cache cambodia cassandra china cnn code community comparative_foreign_policy computing courses_2005fc critique crystal_reports culture data database design development dnn docker download downloads/software economics education email embeddings emr envoy eu europe evaluation example experience facebook fbwall finance flights forex free freeware friends gis golang google gps guide hardware hash hiring history hive howto imported individual_articles indonesia interest international international_phone_calling internet internet_applications interpretability investing ir java javascript-mvc job jobs jobs/study/professional_dev kubernetes learning library linux local_web mail management maps media memory metrics microfinance microsoft microsoft_word ml mobile money monitoring mp3 mp3_players music network networking neural news newzealand nlp nz online opensource optimization overview p2p papers parquet perf perf-testing-theory performance philippines philosophy phone politics presto production productivity programming prop_trading_systems psychology python recipes recsys reference relevance research resources reviews rstats rust s3 scala science search security shopping skype slides social sociology sociology_of_media software solr spam spark sql statistics stats strategy study symantec technology tensorflow testing text theory tips tools torrents trading trading_systems travel tutorial ubuntu utilities video visualization voip vs web web2.0 windows windows_xp/2003 word word2vec wordpress wordpress_wp_plugins writing xbmc

Copy this bookmark:



description:


tags: