jm + parsing   7

Filter before you parse: faster analytics on raw data with Sparser
Super fast JSON parsing. Has some interesting similarities to some code I wrote in SpamAssassin, as it turns out!
json  parsing  performance  coding  algorithms 
august 2018 by jm - Parsing JSON is a Minefield 💣
Crockford chose not to version [the] JSON definition: 'Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.' Yet JSON is defined in at least six different documents.

"Boldest". ffs. :facepalm:
bold  courage  json  parsing  coding  data  formats  interchange  fail  standards  confusion 
october 2016 by jm
command line utility that performs an HTML element selection on HTML content passed to the stdin. Using css selectors that everybody knows. Since input comes from stdin and output is sent to stdout, it can easily be used inside traditional UNIX pipelines to extract content from webpages and html files. tq provides extra formating options such as json-encoding or newlines squashing, so it can play nicely with everyones favourite command line tooling.
tq  linux  unix  cli  command-line  html  parsing  css  tools 
may 2016 by jm
Five Takeaways on the State of Natural Language Processing
Good overview of the state of the art in NLP nowadays. I particularly like word2vec interesting:
Embedding words as real-numbered vectors using a skip-gram, negative-sampling model (word2vec code) was mentioned in nearly every talk I attended. Either companies are using various word2vec implementations directly or they are building diffs off of the basic framework. Trained on large corpora, the vector representations encode concepts in a large dimensional space (usually 200-300 dim).

Quite similar to some tokenization approaches we experimented with in SpamAssassin, so I don't find this too surprising....
word2vec  nlp  tokenization  machine-learning  language  parsing  doc2vec  skip-grams  data-structures  feature-extraction  via:lemonodor 
may 2015 by jm
Structural Regular Expressions
'The current UNIX text processing tools are weakened by the built-in concept of a line. There is a simple notation that can describe the `shape' of files when the typical array-of-lines picture is inadequate. That notation is regular expressions. Using regular expressions to describe the structure in addition to the contents of files has interesting applications, and yields elegant methods for dealing with some problems the current tools handle clumsily. When operations using these expressions are composed, the result is reminiscent of shell pipelines.' Paper by Rob Pike, via adulau. intriguing
sregex  via:adulau  regexp  rob-pike  regex  library  text  structural  parsing  from delicious
november 2009 by jm
sregex - Structural Regular Expressions
'The sregex module implements Structural Regular Expressions.' Python, Apache-licensed
sregex  python  via:adulau  regexp  robpike  regex  library  text  structural  parsing  from delicious
november 2009 by jm

Copy this bookmark: