phnk + stats:text-analysis   214

Overcoming the limitations of topic models with a semi-supervised approach
"… we tested a third algorithm called CorEx (short for Correlation Explanation), a semi-supervised topic model that — unlike LDA and NMF — allowed us to provide the model with “anchor words” that represented potential topics we thought the model should attempt to find."
june 2019 by phnk
SocArXiv Papers | Rhetorics of Radicalism
"What rhetorics run throughout radical discourse, and why do some gain prominence over others? The scholarship on radicalism has largely portrayed radical discourse as opposition to powerful ideas and enemies, but radicals often evince great interest in personal and local concerns. To shed new light on how radicals use and adopt rhetoric, we analyze an original corpus of 23,000 pages produced by Afghan radical groups between 1979 and 2001 using a novel computational abductive approach. We first identify how radicalism not only attacks dominant ideas, actors, and institutions using a rhetoric of subversion, but also how it can employ a rhetoric of reversion to urge intimate transformations in morals and behavior. Next, we find evidence that radicals' networks of support affect the rhetorical mixture they adopt. This, we argue, is due to social ties drawing radicals into encounters with backers' social domains. Our study advances a relational understanding of radical discourse, while also demonstrating how a combination of computational and abductive methods can help theorize and analyze discourses of contention."

Full repl. archive (data, fractional logit models in Stata, topic models in R, etc.) at
r  stata  stats:text-analysis 
may 2019 by phnk
Fighting from the Pulpit: Religious Leaders and Violent Conflict in Israel - Michael Freedman, 2019
Full repl. archive with lots of Web scrapers (R and Python), text preparation and topic modelling.
stats:text-analysis  r  python  world:israel-palestine  religion 
may 2019 by phnk
Thread by @PolicyAuckland: "Learning about text analysis from @kenbenoit Lesson 1. Text is fundamentally qualitative Lesson 2: Lesson 3. Bag of words is a (useful) fict […]"
"Lesson 1. Text is fundamentally qualitative
Lesson 2: Unsupervised methods should be used with great caution
Lesson 3. Bag of words is a (useful) fiction
Lesson 4. Trust no-one. Especially yourself
Lesson 5. Work cooperatively with other
Lesson 6. Making mistakes is good - as long as you fix them
Lesson 7. The hardest part of QTA is getting texts into the computer
Lesson 8. Design is really important
Lesson 9. Software contributions need better recognition
Lesson 10. A project is only as good as its dissemination"
march 2019 by phnk
Speech and Language Processing (3rd ed. draft)
Dan Jurafsky and James H. Martin
Draft chapters in progress, Sep 23, 2018
january 2019 by phnk
Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems | Political Analysis | Cambridge Core
"Measuring the polarization of legislators and parties is a key step in understanding how politics develops over time. But in parliamentary systems—where ideological positions estimated from roll calls may not be informative—producing valid estimates is extremely challenging. We suggest a new measurement strategy that makes innovative use of the “accuracy” of machine classifiers, i.e., the number of correct predictions made as a proportion of all predictions." -- I'm not that convinced: the results show unexpected drops at several points, with no clear explanation.
stats:machine-learning  stats:text-analysis  polisci:parliaments  uk:politics 
january 2019 by phnk
neubig/nlptutorial: A Tutorial about Programming for Natural Language Processing
"Tutorial 0: Programming Basics
Tutorial 1: Unigram Language Models
Tutorial 2: Bigram Language Models
Tutorial 3: Word Segmentation
Tutorial 4: Part-of-Speech Tagging with Hidden Markov Models
Tutorial 5: The Perceptron Algorithm
Tutorial 6: Advanced Discriminative Training
Tutorial 7: Neural Networks
Tutorial 8: Recurrent Neural Networks
Tutorial 9: Topic Models
Tutorial 10: Phrase Structure Parsing
Tutorial 11: Dependency Parsing
Tutorial 12: Structured Perceptron
Tutorial 13: Search Algorithms
Bonus 1: Kana-Kanji Conversion for Japanese Input" -- Both in English and Japanese. Impressssssive.
january 2019 by phnk
Text Mining with Python
Neal Caren (University of North Carolina, Chapel Hill)
stats:text-analysis  python 
january 2019 by phnk
Structural Topic Models for Open‐Ended Survey Responses - Roberts - 2014 - American Journal of Political Science - Wiley Online Library
"Collection and especially analysis of open‐ended survey responses are relatively rare in the discipline and when conducted are almost exclusively done through human coding. We present an alternative, semiautomated approach, the structural topic model (STM) (Roberts, Stewart, and Airoldi 2013; Roberts et al. 2013), that draws on recent developments in machine learning based analysis of textual data. A crucial contribution of the method is that it incorporates information about the document, such as the author's gender, political affiliation, and treatment assignment (if an experimental study). This article focuses on how the STM is helpful for survey researchers and experimentalists. The STM makes analyzing open‐ended responses easier, more revealing, and capable of being used to estimate treatment effects. We illustrate these innovations with analysis of text from surveys and experiments."
stats:text-analysis  stats:surveys 
june 2018 by phnk
CRAN - Package textfeatures
"A tool for extracting some generic features (e.g., number of words, line breaks, characters per word, URLs, lower case, upper case, commas, periods, exclamation points, etc.) from strings of text."
r  stats:text-analysis 
march 2018 by phnk
CRAN - Package slowraker
"A mostly pure-R implementation of the RAKE algorithm (Rose, S., Engel, D., Cramer, N. and Cowley, W. (2010) <doi:10.1002/9780470689646.ch1>), which can be used to extract keywords from documents without any training data." -- Via R-Views 2017/09, as previous links.
r  stats:text-analysis 
november 2017 by phnk
pommedeterresautee/fastrtext: R wrapper for fastText
"fastText is a library for efficient learning of word representations and sentence classification."
r  stats:text-analysis 
november 2017 by phnk
Exploring the World of Philippine Online News
Interesting use of topic models + word2vec to sort out what people are reading about online (on Facebook – data collected via the Graph API).

r  stats:text-analysis  web:facebook 
august 2017 by phnk
BigBang by Datactive ( and sbenthall
"BigBang is a tool for scientific analysis of open source and internet governance communities.


Functionality included in this release:

- Mailman, W3C, IETF, and ICANN mailing list crawling.
- .mbox parsing and data cleaning, including invalid date detection.
- Preprocessing functionality for viewing archival data as daily activity and interaction networks.
- Entity resolution with Levenshtein edit distance.
- Email threading.
- Analysis examples in IPython Notebooks demonstrating network analysis and cohort visualization."
python  stats:text-analysis 
july 2017 by phnk
performance - Fast Levenshtein distance in R? - Stack Overflow
stringdist for short strings, RecordLinkage for longer ones.
r  stats:text-analysis 
july 2017 by phnk
christophergandrud/EIUCrisesMeasure: Real-time perceptions of financial market stress measured using kernel PCA
"We create a continuous measure of real-time perceived stress using a kernel principal component analysis (KPCA) of Economist monthly country reports. We demonstrate the usefulness of our measure by showing that it more accurately captures the effect of financial market stress levels on electoral volatility. We also show how KPCA can be used to efficiently summarize large quantities of texts into cross-sectional time-series variables."
r  stats:pca  political-economy  stats:text-analysis 
may 2017 by phnk
CRAN - Package spacyr
"An R wrapper to the 'Python' 'spaCy' 'NLP' library, from <>."
r  stats:text-analysis 
may 2017 by phnk
« earlier      
per page:    204080120160

bundles : data-science

Copy this bookmark: