natural-language-processing   77

« earlier    

thinkroth/Sentimental
Sentiment analysis tool for node.js based on the AFINN-111 wordlist.
javascript  node  natural-language-processing 
7 weeks ago by vailripper
A Picture of Language - NYTimes.com
"The book was enormously popular, and Mr. Reed and Mr. Brainerd’s diagramming swept through American schools like a refreshing breeze. By the latter half of the 19th century, chalkboards had become increasingly common in classrooms; for students, the impact of watching a sentence take shape on that large surface as a comprehensible, often elegant, and sometimes downright ingenious drawing must have been significant. It’s hard to believe anyone but the most dedicated pedant could have actually enjoyed parsing, but plenty of students — including me — loved diagramming.

A century and a half later, diagramming sentences is even more out of date than writing lessons on a piece of slate. When the book I wrote about it was published in 2006, a couple of hundred people sent me e-mails. One writer accused me of succumbing to Stockholm syndrome because I wrote so benignly about the nun who brainwashed me into thinking diagramming was fun. Another asked me for a date. Two objected to my political attitudes, as they deduced them between the lines. A dozen or so either faulted some of the diagrams or challenged me with a particularly tricky sentence."
grammar  pedagogy  styles-of-thinking  sentence-diagrams  mathematical-recreations  natural-language-processing  it-was-fun 
8 weeks ago by Vaguery
[1112.6045] Comparing intermittency and network measurements of words and their dependency on authorship
Many features from texts and languages can now be inferred from statistical analyses using concepts from complex networks and dynamical systems. In this paper we quantify how topological properties of word co-occurrence networks and intermittency (or burstiness) in word distribution depend on the style of authors. Our database contains 40 books from 8 authors who lived in the 19th and 20th centuries, for which the following network measurements were obtained: clustering coefficient, average shortest path lengths, and betweenness. We found that the two factors with stronger dependency on the authors were the skewness in the distribution of word intermittency and the average shortest paths. Other factors such as the betweeness and the Zipf's law exponent show only weak dependency on authorship. Also assessed was the contribution from each measurement to authorship recognition using three machine learning methods. The best performance was a ca. 65 % accuracy upon combining complex network and intermittency features with the nearest neighbor algorithm. From a detailed analysis of the interdependence of the various metrics it is concluded that the methods used here are complementary for providing short- and long-scale perspectives of texts, which are useful for applications such as identification of topical words and information retrieval.
natural-language-processing  document-clustering  clustering  feature-selection  algorithms  nudge-targets 
january 2012 by Vaguery
[1110.1391] A Comparison of Different Machine Transliteration Models
"Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine transliteration models -- grapheme-based transliteration model, phoneme-based transliteration model, hybrid transliteration model, and correspondence-based transliteration model -- have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine transliteration performance."
natural-language-processing  machine-learning  review  nudge-targets 
october 2011 by Vaguery
[1106.5264] Acquiring Correct Knowledge for Natural Language Generation
"Natural language generation (NLG) systems are computer software systems that produce texts in English and other human languages, often from non-linguistic input data. NLG systems, like most AI systems, need substantial amounts of knowledge. However, our experience in two NLG projects suggests that it is difficult to acquire correct knowledge for NLG systems; indeed, every knowledge acquisition (KA) technique we tried had significant problems. In general terms, these problems were due to the complexity, novelty, and poorly understood nature of the tasks our systems attempted, and were worsened by the fact that people write so differently. This meant in particular that corpus-based KA approaches suffered because it was impossible to assemble a sizable corpus of high-quality consistent manually written texts in our domains; and structured expert-oriented KA techniques suffered because experts disagreed and because we could not get enough information about special and unusual cases to build robust systems. We believe that such problems are likely to affect many other NLG systems as well. In the long term, we hope that new KA techniques may emerge to help NLG system builders. In the shorter term, we believe that understanding how individual KA techniques can fail, and using a mixture of different KA techniques with different strengths and weaknesses, can help developers acquire NLG knowledge that is mostly correct."
natural-language-processing  artificial-intelligence  interesting-problems  high-hanging-fruit  machine-learning  nudge-targets 
october 2011 by Vaguery
[1107.1322] Text Classification: A Sequential Reading Approach
"We propose to model the text classification process as a sequential decision process. In this process, an agent learns to classify documents into topics while reading the document sentences sequentially and learns to stop as soon as enough information was read for deciding. The proposed algorithm is based on a modelisation of Text Classification as a Markov Decision Process and learns by using Reinforcement Learning. Experiments on four different classical mono-label corpora show that the proposed approach performs comparably to classical SVM approaches for large training sets, and better for small training sets. In addition, the model automatically adapts its reading process to the quantity of training information provided."
text-classification  natural-language-processing  machine-learning  nudge-targets 
august 2011 by Vaguery
Weka 3 - Data Mining with Open Source Machine Learning Software in Java
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

Weka is open source software issued under the GNU General Public License.
java  machine-learning  foss  data-mining  NLP  natural-language-processing  algorithms 
may 2011 by approximatelylinear
ashleyw/phrasie - GitHub
Determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.
Ruby  library  tagging  natural-language-processing  NLP  statistics  text-mining 
may 2011 by Vaguery

« earlier    

Copy this bookmark:



description:


tags: