robincamille + thesis 3
Automatic standardisation of texts containing spelling variation: How much training data do you need?
november 2015 by robincamille
Alistair Baron and Paul Rayson
Computing Department
Lancaster University
thesis
Computing Department
Lancaster University
november 2015 by robincamille
Introducing DREaM | Early Modern Conversions
november 2015 by robincamille
VARD normalization for all texts - % problematic
thesis
november 2015 by robincamille
HTRC Portal - About
june 2014 by robincamille
Extracted Features From the HTRC
Note that this is an alpha data release. Please send feedback to htrc-support-l@list.indiana.edu.
A great deal of fruitful research can be performed using non-consumptive pre-extracted features. For this reason, HTRC has put together a select set of page-level features extracted from the HathiTrust's non-Google-digitized public domain volumes. The source texts for this set of feature files are primarily in English.
Features are notable or informative characteristics of the text. We have processed a number of useful features, including part-of-speech tagged token counts, header and footer identification, and various line-level information. This is all provided per-page. Providing token information at the page level makes it possible to separate text from paratext. (An example of the latter may be: thirty pages of publishers’ ads at the back of a book). We have also decided to break each page into a collection of three parts: header, body, and footer. The specific features that we extract from the text are described in more detail below.
linguistics
thesis
Note that this is an alpha data release. Please send feedback to htrc-support-l@list.indiana.edu.
A great deal of fruitful research can be performed using non-consumptive pre-extracted features. For this reason, HTRC has put together a select set of page-level features extracted from the HathiTrust's non-Google-digitized public domain volumes. The source texts for this set of feature files are primarily in English.
Features are notable or informative characteristics of the text. We have processed a number of useful features, including part-of-speech tagged token counts, header and footer identification, and various line-level information. This is all provided per-page. Providing token information at the page level makes it possible to separate text from paratext. (An example of the latter may be: thirty pages of publishers’ ads at the back of a book). We have also decided to break each page into a collection of three parts: header, body, and footer. The specific features that we extract from the text are described in more detail below.
june 2014 by robincamille