jm + regexps   12

a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre library, yet functions as a standalone library with its own API written in C. Hyperscan uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions, as well as matching of regular expressions across streams of data. Hyperscan is typically used in a DPI library stack.

Hyperscan began in 2008, and evolved from a commercial closed-source product 2009-2015. First developed at Sensory Networks Incorporated, and later acquired and released as open source software by Intel in October 2015. 

Hyperscan is under a 3-clause BSD license. We welcome outside contributors.

This is really impressive -- state of the art in parallel regexp matching has improved quite a lot since I was last looking at it.

(via Tony Finch)
via:fanf  regexps  regular-expressions  text  matching  pattern-matching  intel  open-source  bsd  c  dpi  scanning  sensory-networks 
5 weeks ago by jm
Regexp Disaster
Course notes from Gerald Jay Sussman's "Adventures in Advanced Symbolic Programming" class at MIT. Hard to argue with this:
The syntax of the regular-expression language is awful. There are various incompatable forms of the language and the quotation conventions are baroquen [sic]. Nevertheless, there is a great deal of useful software, for example grep, that uses regular expressions to specify the desired behavior.

Although regular-expression systems are derived from a perfectly good mathematical formalism, the particular choices made by implementers to expand the formalism into useful software systems are often
disastrous: the quotation conventions adopted are highly irregular; the egregious misuse of parentheses, both for grouping and for backward reference, is a miracle to behold. In addition, attempts to
increase the expressive power and address shortcomings of earlier designs have led to a proliferation of incompatible derivative languages.

(via Rob Pike's twitter:
regex  regexps  regular-expressions  functional  combinators  gjs  rob-pike  coding  languages 
july 2016 by jm
a high-performance multiple regex matching library. Hyperscan uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions and for the matching of regular expressions across streams of data.

Via Tony Finch
via:fanf  regexps  regex  dpi  hyperscan  dfa  nfa  hybrid-automata  text-matching  matching  text  strings  streams 
october 2015 by jm
a regex-based, Turing-complete programming language. It's main feature is taking some text via standard input and repeatedly applying regex operations to it (e.g. matching, splitting, and most of all replacing). Under the hood, it uses .NET's regex engine, which means that both the .NET flavour and the ECMAScript flavour are available.

Reminscent of sed(1); see for an example Retina program
retina  regexps  regexes  regular-expressions  coding  hacks  dot-net  languages 
september 2015 by jm
A dive into a UTF-8 validation regexp
Once again, I find myself checking over the UTF-8 validation code in websocket-driver, and once again I find I cannot ever remember how to make sense of this regex that performs the validation. I just copied it off a webpage once and it took a while (and reimplementing UTF-8 myself) to fully understand what it does. If you write software that processes text, you’ll probably need to understand this too.
utf-8  unicode  utf8  javascript  node  encoding  text  strings  validation  websockets  regular-expressions  regexps 
june 2014 by jm
Peter Norvig writes a program to play regex golf with arbitrary lists
In response to XKCD 1313. This is excellent. It's reminiscent of my SpamAssassin SOUGHT-ruleset regexp-discovery algorithm, described in , albeit without the BLAST step intended to maximise pattern length and minimise false positives
python  regex  xkcd  blast  rule-discovery  spamassassin  rules  regexps  regular-expressions  algorithms  peter-norvig 
january 2014 by jm
Can regular expressions parse HTML?
'a summary of the main points:
The “regular expressions” used by programmers have very little in common with the original notion of regularity in the context of formal language theory.
Regular expressions (at least PCRE) can match all context-free languages. As such they can also match well-formed HTML and pretty much all other programming languages.
Regular expressions can match at least some context-sensitive languages.
Matching of regular expressions is NP-complete. As such you can solve any other NP problem using regular expressions.'
compsci  regexps  regular-expressions  programming  np-complete  chomsky-grammar  context-free  languages 
february 2013 by jm
PCRE Performance Project
Excellent stuff. Using "sljit", a stackless platform-independent JIT compiler, this compiles Perl-compatible regular expressions to machine code on ARM, x86, MIPS and PowerPC platforms, resulting in 'similar matching speed to DFA based engines (like re2) on common patterns' with Perl compatibility. 'This work has been released as part of PCRE 8.20 and above. Now (PCRE 8.31), nearly all PCRE features are supported including UTF-8/16 and partial matching.'
pcre  regexps  regex  performance  optimization  jit  compilation  dfa  re2  via:akohli 
september 2012 by jm
demerphq on "perl's regexps are slow"
His classic response to the Russ Cox DFA-over-NFA regular expressions paper. 'A general purpose regex engine like that required for perl has to be able to do a lot, and has to balance considerations ranging from memory footprint of a compiled object, construction time, flexibility, rich feature-sets, the ability to accomodate huge character sets, and of course most importantly matching performance. And it turns out that while DFA engines have a very good worst case match time, they dont actually have too many other redeeming features. Construction can be extremely slow, the memory footprint vast, all kinds of trickery is involved to do unicode or capturing properly and they aren't suitable for patterns with backreferences.' -- Also interesting to note that he mentions an approach I've used in several SpamAssassin speedup add-ons, too ;)
performance  perl  regular-expressions  perlmonks  demerphq  regexps  dfa  nfa  state-machines 
april 2011 by jm
RE2: a principled approach to regular expression matching
Russ Cox' C++ lib to provide safer, guaranteed-linear-time, non-exponential regexps, at the cost of dropping support for backreferences and generalized zero-width assertions. actually looks quite useful, unlike most "I've fixed regexps" claims ;)
regular-expressions  regexps  efficiency  linear-time  exponential-time  backreferences  google  re2  from delicious
march 2010 by jm
'free, open, developer-generated APIs for a wide variety of websites. is a place to create and share them. [..] Check out [..] ways to use parselets from our web service, Ruby, Python, C/C++, or the *nix command-line.'
parselets  scraping  html  web  regexps  sitescooper  json  from delicious
december 2009 by jm
RegExr: Online Regular Expression Testing Tool
a very nice interactive editor in Flash, supporting lots of the usual perlish stuff. via Joe
via:jdrumgoole  regexps  regular-expressions  spamassassin  rule-dev  flash  regex  flex  utilities  from delicious
december 2009 by jm

Copy this bookmark: