textprocessing   773

« earlier    

doccano/doccano · GitHub
doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on. Just create a project, upload data and start annotating. You can build a dataset in hours.
textprocessing  nlp  annotation 
12 days ago by vrt
Introduction to text manipulation on UNIX-based systems – IBM Developer
This introduction to text manipulation on UNIX platforms provides an overview of some common commands widely available and installed standard on most UNIX-based releases. Many times these standard utilities are ignored in favor of more modern text-processors such as Perl, Python, or Ruby, which are not always installed on a system. An introductory review of these tools helps practitioners who are learning UNIX or Linux or those who may be looking to renew forgotten knowledge.
bash  cli  commandline  shell  unix  textprocessing  textmanipulation  guide  text 
13 days ago by justusthane
Paste to Markdown
Uses Turndown (JavaScript library). Good for Mac where 'pbpaste' can't be coerced into outputting raw HTML in a pipe to, say, 'pandoc'.
html  rtf  markdown  conversion  textprocessing  webapp  utility  solution 
5 weeks ago by kme
Useful Unix commands for data science
via: http://johnkerl.org/miller/doc/originality.html
Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column.

OK, but I'd mention the useless use of 'cat' to anyone learning from this guide. Alternatives:
<code class="language-bash">
<data.csv awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }' data.csv
unix  textprocessing  datascience  commandline  reference  newbie 
7 weeks ago by kme
text processing - how to massage or format html in order to parse with xmstarlet? - Unix & Linux Stack Exchange
Pretty key when the input is HTML but not XHTML:
<code class="language-bash">xmlstarlet fo -H -R </code>
xmlstarlet  malformed  html  webdevel  textprocessing  commandline  cli  solution 
8 weeks ago by kme
xml - how to? xmlstarlet to extract HTML data by id - Stack Overflow
Essential tip for namespaced HTML, otherwise you get... NOTHING out of 'xmlstarlet'

Just passing HTML through 'xml fo -H -R' (process as HTML and recover as much as possible) is enough to get un-namespaced HTML that is also valid XML (source: https://unix.stackexchange.com/a/382928/278323).

The html data has a default namespace that you have to declare in the xmlstarlet command:
<code class="language-bash">
xmlstarlet sel \
-N n="http://www.w3.org/1999/xhtml" \
-t \
-c "/n:html/n:body/n:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null

UPDATE: I didn't know it but as the error message says, there is no need to declare the namespace when it's the default one, so also this works:
<code class="language-bash">
xmlstarlet sel \
-t \
-c "/_:html/_:body/_:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null
xml  xmlstarlet  textprocessing  malformed  html  reference  namespaced  xhtml  solution  fuckina 
8 weeks ago by kme
slowcat.py - print a file slowly
This works, whereas 'slowcat.c' didn't for some reason (on macOS).
unix  linux  terminal  textprocessing  utility  software 
8 weeks ago by kme

« earlier    

related tags

2014  2018  ack  ag  ai  algorithms  alternativeto  amazon.aws  analytics  annotation  annoyance  ascii  azure  bagofwords  bash  beautifier  bigdata  bom  books  browser  bsd  businessideas  cas  cheatsheet  chrome  cli  clojure  collection  commandline  commandsubstitution  computervision  conversion  converter  cookbook  crossword  csv  darwin  dask  data  datamining  datamunging  dataquality  datascience  datasets  ddj  debian  debugging  decode  decoding  deeplearning  dependency  desktop  devel  digitalhumanities  django  djb  dns  documentation  downloads  elasticsearch  email  encodedecode  encryption  entityextraction  escaping  example  examplecode  expansion  explained  faq  fileformats  filtering  fmt  fold  forumthread  ftp  fuckina  funstuff  game  gensim  german  go  grep  groff  guide  hacking  handwriting  haskell  hax0r  howto  html  htmlentity  http  ia  imageprocessing  importexport  internet  ipv6  issue  javascript  join  jq  jshon  json  kera  knowledge  l33t  language  latentdirichletallocation  latex  lda  lemmatization  libgen  library  likeawk  links  linux  machinelearning  macos  malformed  manuscript  markdown  markup  maybesolution  microsoft  ml  models  morphological  namespaced  naturallanguageprocessing  neo4j  network  neural  neuralnet  neuralnetwork  newbie  newlines  newlineterminator  ngram  nicar18  nlp  nlproc  nltk  nodejs  nonprintingcharacters  npm  ocr  oneliners  online  opencv  opensource  par  parser  parsing  patternmatching  pdf  perl  pii  pipe  pipeline  playlist  posix  powershell  prettifier  privacy  programming  protocol  proxy  py  python  pytorch  recordlinkage  reference  reflow  reformatting  regex  revealcodes  rtf  rust  scraping  scrapy  screen  script  scripting  search  searching  security  sed  segmentation  shell  shellscripting  sklearn  software  solution  sortof  sourcecode  spacy  split  sql  ssh  stackexchange  steganography  streamediting  stringmanipulation  strings  syntax  tagging  tcpip  terminal  tesseract  test  tex  text  textananalysis  textclassification  textcleaning  textextraction  textfiles  textgeneration  textmanipulation  textsummarization  tips  tipsandtricks  tokenization  tool  tools  topicmodeling  transcription  troff  troubleshooting  tutorial  typesetting  unicode  unix  unixtoolbox  urlencoding  utf  utf16  utf8  utility  video  vim  visualization  vocabulary  webapp  webdevel  webserver  webservices  windows  wordsegmentation  workaround  xhtml  xml  xmlstarlet  xpath  xsl  yaml 

Copy this bookmark: