mjaniec + text_mining   21

R Text Data Compilation
Repository of textual data sets and packages for text mining/NLP in R
R  text_mining  NLP 
12 weeks ago by mjaniec
[2018-09-09] An Introduction to Text Processing and Analysis with R
Dealing with text is typically not even considered in the applied statistical training of most disciplines. This is in direct contrast with how often it has to be dealt with prior to more common analysis, or how interesting it might be to have text be the focus of analysis. This document and corresponding workshop will aim to provide a sense of the things one can do with text, and the sorts of analyses that might be useful.

text_mining  R 
may 2019 by mjaniec
Tensorflow - Vector Representations of Words
Vector space models (VSMs) represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points ('are embedded nearby each other'). VSMs have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning. The different approaches that leverage this principle can be divided into two categories: count-based methods (e.g. Latent Semantic Analysis), and predictive methods (e.g. neural probabilistic language models).
text_mining  word2vec  TensorFlow  Google  Python 
may 2019 by mjaniec
A Beginner's Guide to Word2Vec and Neural Word Embeddings
Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand.
text_mining  word2vec  deep_learning 
may 2019 by mjaniec
medium 20180512 - Word2Vec — a baby step in Deep Learning but a giant leap towards Natural Language Processing
Word2Vec model is used for learning vector representations of words called “word embeddings”. This is typically done as a preprocessing step, after which the learned vectors are fed into a discriminative model (typically an RNN) to generate predictions and perform all sort of interesting things.
text_mining  word2vec 
may 2019 by mjaniec
R wrapper to google's word2vec. The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words.

text_mining  R  Google 
may 2019 by mjaniec
R and Python text analysis packages performance comparison
The performance gain of quanteda’s new architecture became apparent in the head-to-head comparison with gensim. Quanteda’s execution time is around 50% shorter, and peak memory consumption is 40% smaller than gensim.
text_mining  R  Python  quanteda  gensim 
may 2019 by mjaniec
FT 20131114 - IBM’s Watson artificial intelligence to get new homes with rivals
IBM has opened up its Watson artificial intelligence technology for use by other software developers, potentially paving the way for a new generation of online services capable of responding to their users in smarter and more natural ways.

In a move that could bring the hardcore computing science to mass consumer markets, IBM said that other developers would be able to “plug into” Watson, drawing on its analytical power to inject an extra level of intelligence into their own applications.

Echoing the “Intel inside” branding that turned the chipmaker into a household name, applications that use IBM’s most advanced cognitive computing technology will be branded as “Powered by Watson”.

Users would pay for the technologies based on the number of queries they put through Watson and the amount of storage they used in the system.
Watson  IBM  AI  text_mining  natural_language_processing 
november 2013 by mjaniec
Forbes 20130310 - Google's 4 Biggest Technical Challenges, According To Search Guru Amit Singhal - SXSW
Those four technical challenges are: the knowledge graph, speech recognition, natural language understanding and (understanding) conversation, he said. These four areas are technical problems that, despite Google’s improvements, are still “not solved,” Singha said.

The knowledge graph is about “understanding the world as you and I do”–in other words, the connections between things and ideas and how they relate to each other, which underlies Google’s core search focus. Speech recognition is translating the human voice into text, which is key to things like searching by voice. Natural language is understanding the nuances of language, which allows the conversion of voice transcription into meaningful information. Conversation is related to natural language.

Singhal, who joined Google 12 years ago, emphasized a vision of search based on the sci-fi vision in Star Trek. In the show, Captain Kirk would ask the computer any question and the computer would spit out an answer. Singhal pointed to Google Now, which is designed to send users information before they even search for it, such as flight delay information, or when someone should leave for a meeting, taking into account traffic.

“It should tell you things when you don’t ask it. If your flight is delayed you shouldn’t have to ask what’s the status. It should just know. Or you have a meeting an hour away and there’s bad traffic. Google should tell you, you’d better leave now. Our vision of Google is things you need to know just come to you… Our dream is for search to become the Star Trek computer. That’s what we’re building today.”
Google  Knowledge_Graph  text_mining  AI  semantics  Amit_Singhal 
march 2013 by mjaniec
PCWorld 20130205 - IBM puts supercomputer Watson's smarts in SMB servers
The company's new Power Express servers announced on Tuesday will integrate some hardware and software elements derived from Watson. The servers start at US$5,947, and IBM is targeting the new products at businesses with over 100 employees.

The new Power Express 710, 720, 730 and 740 servers include IBM's Power7+ chips, which were introduced in October. By lowering the price of the servers, IBM hopes to take on rivals like Hewlett-Packard and Dell, which sell large volumes of commodity servers based on x86 chips.

With Watson technologies, companies can use the new servers to analyze warehouses of data, and to answer complex queries with high levels of confidence. The technologies will provide insights into structured and unstructured data at a cheaper cost, said Steve Sibley, director of Power Systems offering management at IBM.

Watson used advanced algorithms and a natural language interface to answer questions on Jeopardy, but not all advanced technologies will make it to the new entry-level servers. Some common features such as the core customized software to analyze warehouses of data will be available depending on the price, configuration and target market.

* * *

note: "Ninety percent of the world’s data was created in the last two years and 80 percent of that is unstructured — a gold mine for those seeking to make breakthroughs in “big data” research."
IBM  Watson  Power_Express  Power7  text_mining  big_data 
february 2013 by mjaniec
Bloomberg 20130111 - DCM Capital Starts Trading Venue Mining Twitter Sentiment
DCM Capital Ltd. plans to open a trading venue that it says will be the first to offer analysis of posts on social networks such as Twitter Inc. and Facebook Inc. (FB) to help traders gauge market mood.

The spread-betting platform, called DCM Dealer, will include a feature that calculates a sentiment rating for individual stocks, indexes and commodities when it opens next week, Paul Hawtin, founder and chief executive officer of DCM Capital, said in a phone interview today. An algorithm gives securities a rating from zero to 100, where the top end indicates the most positive feelings from social media.

“Financial markets are driven by greed and fear but we’ve never had a way to quantify emotion before,” Hawtin said. “We monitor 350 million tweets daily, 2.5 billion weekly. So now investors can monitor investors’ sentiment and human emotion on specific instruments in real time.”
social_networking  Twitter  Facebook  text_mining  DCM_Capital  Derwen_Capital_Markets 
january 2013 by mjaniec
ZeroHedge 20120831 - Algos Set New Speed Reading Record: 4549 Words In 20 Milliseconds
The market is indeed a discounting mechanism it appears. In a mere 20 milliseconds, the world's 'traders' had managed to read Bernanke's 4549-word script, interpret it (as bearish in this case - which apparently is wrong now?) and start to sell down the major equity indices.

As Nanex points out, not only was the reaction lightning fast (actually faster than lightning) but it occurred in their newly created 'fantaseconds' as trades were timestamped 'before' the bids and offers were even seen in the data-feed.
high_frequency_trading  text_mining  algorithmic_trading 
september 2012 by mjaniec
TechReview 20120614 - Google's New Brain Could Have a Big Impact
The Knowledge Graph is already starting to appear in a few other Google products, and could be used to add intelligence to all of the company's software.

Knowledge Graph has already been plugged into YouTube, where it is being used to organize videos by topic and to suggest new videos to users, based on what they just watched. It could also be used to connect and recommend news articles based on the specific facts mentioned in stories.

"Search was mostly based on matching words and phrases, and not what they actually mean," says Shashidar Thakur, the tech lead for the Knowledge Graph in Google's search team. Thakur says the project was invented to change that.

When a person searches on Google, the conventional results are based on algorithms that look for matches with the terms rather than the meaning of the information entered into the search box. Google's algorithms first refer to data from past searches to determine which words in the query string are most likely to be important (based on how often they have been used by previous searchers). Next, software accesses a list of Web pages known to contain information related to those terms—known as reverse indexes. Finally, another calculation is used to rank the results shown to the searcher.

Thakur says that his current priority is to find ways to use the Knowledge Graph to answer more complex questions—some of which seem similar to those tackled by "knowledge engine" Wolfram Alpha. "Right now what we have is answering questions about entities, but there are harder questions," he says. "For example: 'volcanoes that exploded in the eighteenth century,' or 'movies based on books.'"

The Knowledge Graph can be thought of as a vast database that allows Google's software to connect facts on people, places, and things to one another. Google got the Knowledge Graph project started when it bought a startup called Metaweb in 2010; at that time, the resource contained only 12 million entries. Today it has more than 500 million entries, with more than 3.5 billion (3.5*10^9) links between them.

[note: Each of the 10^11 (one hundred billion) neurons in human brain has on average 7,000 synaptic connections to other neurons. It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5 x 10^14 synapses (100 to 500 trillion).

source: http://en.wikipedia.org/wiki/Neuron#Connectivity]

Kingsley Idehen, founder of semantic technology company OpenLink Software, says that Knowledge Graph is not really helping advance the semantic Web because it is not openly accessible—despite being compiled using open data sources such as Wikipedia and Freebase. If Google were to open up its Knowledge Graph for others to use, then the Web as a whole could get much smarter.
Google  Knowledge_Graph  AI  semantics  Shashidar_Thakur  brain  text_mining 
june 2012 by mjaniec
ExtremeTech 20120522 - Google’s Knowledge Graph: Wikipedia on steroids, or the beginning of the end for the web?
Google’s Knowledge Graph is brand new — “only” containing an estimated 3.5 billion facts about 500 million objects — but of course it will grow as rapidly as the Googleplex can organize additional information. Even now it is a powerful tool for those who want quick answers, and don’t like wasting their time surfing to get them — loosely described as semantic search. For comparison, Wikipedia currently has less than 30 million pages.

Once Google has effectively tied its Knowledge Graph into its digitized library of almost every book on the planet and scraped the contents of the semantic web into its Googleplex, it will have a practical monopoly on access to many kinds of information .
Google  Knowledge_Graph  AI  text_mining 
may 2012 by mjaniec
PC World 20120517 - Google Knowledge Graph: The Birth of a Siri Rival?
In a blog post, Google's Amit Singhal dropped a strong hint that there's more to Knowledge Graph than meets the eye:

We’re proud of our first baby step—the Knowledge Graph—which will enable us to make search more intelligent, moving us closer to the 'Star Trek computer' that I've always dreamt of building”.

“If [Siri]’s Star Wars, you have these robot personalities like C-3PO who runs around and he tries to do stuff for you, messes up and makes jokes, he’s kind of a comic relief guy. Our approach is more like Star Trek, right, starship Enterprise; every piece of computing surface, everything is voice-aware. It’s not that there’s a personality, it doesn’t have a name, it’s just 'Computer.'”

Add these comments to the rumors that Google is building a virtual assistant codenamed Majel--named after the wife of late Star Trek creator Gene Roddenberry--and it's easy to speculate where Google is going.

Google  Google_X  Knowledge_Graph  Majel  AI  Siri  text_mining 
may 2012 by mjaniec
Bloomberg 20120306 - IBM’s Watson Computer Gets a Wall Street Job
International Business Machines Corp. (IBM)’s Watson computer, which beat champions of the quiz show “Jeopardy!” a year ago, will soon be advising Wall Street on risks, portfolios and clients.

IBM plans to use Watson in financial services “mostly for portfolio risk management, they’re not going to do stock picking,” CLSA’s Maguire said in a Feb. 17 phone interview. “They think that Watson can make a difference.” Still, Watson isn’t perfect. It is weak in languages other than English, and its processing of social media streams from platforms including Facebook and Twitter can be sluggish.

In addition to Citigroup, Armonk, New York-based IBM has been working with financial institutions teaching Watson the language of Wall Street, and adding content including regulatory announcements, news and social media feeds.

Watson the financial assistant will be delivered as a cloud-based service and earn a percentage of the additional revenue and cost savings it is able to help financial institutions realize.
IBM  Watson  Citigroup  risk_management  AI  text_mining  DeepQA 
march 2012 by mjaniec
S.R.K. Bravan et. al. - Learning to Win by Reading Manuals in a Monte-Carlo Framework
This paper presents a novel approach for leveraging automatically extracted textual knowledge to improve the performance of controlapplications such as games. Our ultimate goalis to enrich a stochastic player with highlevel guidance expressed in text. Our modeljointly learns to identify text that is relevantto a given game state in addition to learning game strategies guided by the selectedtext. Our method operates in the Monte-Carlo search framework, and learns both text analysis and game strategies based only on environment feedback. We apply our approach tothe complex strategy game Civilization II using the official game manual as the text guide.Our results show that a linguistically-informedgame-playing agent significantly outperformsits language-unaware counterpart, yielding a 27% absolute improvement and winning over 78% of games when playing against the built-in AI of Civilization II.
MIT  machine_learning  monte_carlo_simulation  Monte_Carlo_search_framework  Markov_Decision_Process  text_mining  neural_networks  filetype:pdf  media:document 
july 2011 by mjaniec
MIT news 20110712 - Computer learns language by playing games
The system begins with virtually no prior knowledge about the task it’s intended to perform or the language in which the instructions are written. It has a list of actions it can take, like right-clicks or left-clicks, or moving the cursor; it has access to the information displayed on-screen; and it has some way of gauging its success, like whether the software has been installed or whether it wins the game. But it doesn’t know what actions correspond to what words in the instruction set, and it doesn’t know what the objects in the game world represent. 

So initially, its behavior is almost totally random. But as it takes various actions, different words appear on screen, and it can look for instances of those words in the instruction set. It can also search the surrounding text for associated words, and develop hypotheses about what actions those words correspond to. Hypotheses that consistently lead to good results are given greater credence. 
machine_learning  AI  text_mining  MIT 
july 2011 by mjaniec
Product Named Entity Recognition Based on Hierarchical Hidden Markov Model
This paper presented a hierarchicalHMM(hidden Markov model) based approach of product named entity recognition from Chinese free text. By unifying some heuristic rules into a statistical framework based on a mixture model of HHMM, the approach we proposed can leverage diverse range of linguistic features and knowledge sources to make probabilistically reasonable decisions for a global optimization. The prototype system we built achieved the overall F-measure of 79.7%, 86.9%, 75.8% corresponding to PRO, TYP, BRA respectively, which also provide experimental evidence to some extent on its portability to different domains.

Product named entity recognition involves the identification of product-related proper namesin free text and their classification into differentkinds of product named entities, referring to PRO, TYP and BRA in this paper. In comparison with general NER, nested product NEs should be tagged separately rather than being tagged just as a single item.
text_mining  information_extraction  named_entity_recognition  HMM  filetype:pdf  media:document 
july 2011 by mjaniec
IBM 201104 - IBM Content Analytics software powers Watson to a Jeopardy! win
The same core NLP technology used in Watson is available now to deliver business value. Watson uses the same capabilities found in IBM Content Analytics software to unlock the value embedded in the massive amounts of unstructured information in the many systems and formats you have today.

• Aggregating and extracting content from multiple internal and external sources and types, including enterprise content management (ECM) repositories, structured data, social media, call center logs, research reports, transcripts, email, safety reports and legal contracts

• Organizing, analyzing and visualizing enterprise content (and data) using NLP and other analytics to understand meaning and identify trends, patterns, correlations, anomalies and business context

• Interactively searching and exploring to derive rapid insight by confirming what is suspected or by uncovering something new—all without building models or deploying complex systems
IBM  Watson  Jeopardy  text_mining  filetype:pdf  media:document 
july 2011 by mjaniec
FT 20100716 - Decoding the psychology of trading
From the tens of thousands of newspaper articles, blogs, corporate presentations and Twitter messages being analysed every day, MarketPsy builds a picture of investor feelings about 6,000 companies. When emotions are running high, the hedge fund steps in and trades. MarketPsy is tiny by hedge fund standards, but it is at the cutting edge of behavioural finance, the intellectual offspring of psychology and economics. Like others focused on behaviour, MarketPsy has long since discarded the idea that investors are rational, and tries instead to judge exactly how irrational they are. The answer is simple to explain, if hard to implement: two standard deviations of moves in its measures of emotion is a strong trading signal. When people are deeply gloomy about a stock, it is time to buy; when they are raving about its brilliance, sell. The strategy worked beautifully through the crisis, returning about 45% in its 1st 12 months, and 30% last year. This year it has lost money.
MarketPsy_Capital  hedge_funds  behavioural_finance  social_networking  algorithmic_trading  text_mining 
july 2010 by mjaniec

Copy this bookmark: