robincamille + thesis   3

Introducing DREaM | Early Modern Conversions
VARD normalization for all texts - % problematic
thesis 
november 2015 by robincamille
HTRC Portal - About
Extracted Features From the HTRC
Note that this is an alpha data release. Please send feedback to htrc-support-l@list.indiana.edu.

A great deal of fruitful research can be performed using non-consumptive pre-extracted features. For this reason, HTRC has put together a select set of page-level features extracted from the HathiTrust's non-Google-digitized public domain volumes. The source texts for this set of feature files are primarily in English.

Features are notable or informative characteristics of the text. We have processed a number of useful features, including part-of-speech tagged token counts, header and footer identification, and various line-level information. This is all provided per-page. Providing token information at the page level makes it possible to separate text from paratext. (An example of the latter may be: thirty pages of publishers’ ads at the back of a book). We have also decided to break each page into a collection of three parts: header, body, and footer. The specific features that we extract from the text are described in more detail below.
linguistics  thesis 
june 2014 by robincamille

related tags

linguistics  thesis 

Copy this bookmark:



description:


tags: