doccano/doccano · GitHub
doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on. Just create a project, upload data and start annotating. You can build a dataset in hours.
textprocessing  nlp  annotation 
12 days ago by vrt
Introduction to text manipulation on UNIX-based systems – IBM Developer
This introduction to text manipulation on UNIX platforms provides an overview of some common commands widely available and installed standard on most UNIX-based releases. Many times these standard utilities are ignored in favor of more modern text-processors such as Perl, Python, or Ruby, which are not always installed on a system. An introductory review of these tools helps practitioners who are learning UNIX or Linux or those who may be looking to renew forgotten knowledge.
bash  cli  commandline  shell  unix  textprocessing  textmanipulation  guide  text 
13 days ago by justusthane
Paste to Markdown
Uses Turndown (JavaScript library). Good for Mac where 'pbpaste' can't be coerced into outputting raw HTML in a pipe to, say, 'pandoc'.
html  rtf  markdown  conversion  textprocessing  webapp  utility  solution 
5 weeks ago by kme
Useful Unix commands for data science
via: http://johnkerl.org/miller/doc/originality.html
Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column.

OK, but I'd mention the useless use of 'cat' to anyone learning from this guide. Alternatives:
<code class="language-bash">
<data.csv awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }' data.csv
unix  textprocessing  datascience  commandline  reference  newbie 
7 weeks ago by kme
text processing - how to massage or format html in order to parse with xmstarlet? - Unix & Linux Stack Exchange
Pretty key when the input is HTML but not XHTML:
<code class="language-bash">xmlstarlet fo -H -R </code>
xmlstarlet  malformed  html  webdevel  textprocessing  commandline  cli  solution 
8 weeks ago by kme
xml - how to? xmlstarlet to extract HTML data by id - Stack Overflow
Essential tip for namespaced HTML, otherwise you get... NOTHING out of 'xmlstarlet'

Just passing HTML through 'xml fo -H -R' (process as HTML and recover as much as possible) is enough to get un-namespaced HTML that is also valid XML (source: https://unix.stackexchange.com/a/382928/278323).

The html data has a default namespace that you have to declare in the xmlstarlet command:
<code class="language-bash">
xmlstarlet sel \
-N n="http://www.w3.org/1999/xhtml" \
-t \
-c "/n:html/n:body/n:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null

UPDATE: I didn't know it but as the error message says, there is no need to declare the namespace when it's the default one, so also this works:
<code class="language-bash">
xmlstarlet sel \
-t \
-c "/_:html/_:body/_:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null
xml  xmlstarlet  textprocessing  malformed  html  reference  namespaced  xhtml  solution  fuckina 
8 weeks ago by kme
slowcat.py - print a file slowly
This works, whereas 'slowcat.c' didn't for some reason (on macOS).
unix  linux  terminal  textprocessing  utility  software 
8 weeks ago by kme

