jm + nfa   4

Paper: Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs
a software based, large-scale regex matcher designed to match multiple patterns at once (up to tens of thousands of patterns at once) and to ‘stream‘ (that is, match patterns across many different ‘stream writes’ without holding on to all the data you’ve ever seen). To my knowledge this makes it unique.

RE2 is software based but doesn’t scale to large numbers of patterns; nor does it stream (although it could). It occupies a fundamentally different niche to Hyperscan; we compared the performance of RE2::Set (the RE2 multiple pattern interface) to Hyperscan a while back.

Most back-tracking matchers (such as libpcre) are one pattern at a time and are inherently incapable of streaming, due to their requirement to backtrack into arbitrary amounts of old input.
regex  regular-expressions  algorithms  hyperscan  sensory-networks  regexps  simd  nfa 
20 days ago by jm
Lessons Learned from Using Regexes At Scale
great post from Loggly on production usage of regular expressions on shared, multitenant architecture, where a /.*/ can really screw things up. "NFA isn't a golden ticket" paragraph included
loggly  regexp  regexes  java  dfa  nfa  architecture 
september 2016 by jm
a high-performance multiple regex matching library. Hyperscan uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions and for the matching of regular expressions across streams of data.

Via Tony Finch
via:fanf  regexps  regex  dpi  hyperscan  dfa  nfa  hybrid-automata  text-matching  matching  text  strings  streams 
october 2015 by jm
demerphq on "perl's regexps are slow"
His classic response to the Russ Cox DFA-over-NFA regular expressions paper. 'A general purpose regex engine like that required for perl has to be able to do a lot, and has to balance considerations ranging from memory footprint of a compiled object, construction time, flexibility, rich feature-sets, the ability to accomodate huge character sets, and of course most importantly matching performance. And it turns out that while DFA engines have a very good worst case match time, they dont actually have too many other redeeming features. Construction can be extremely slow, the memory footprint vast, all kinds of trickery is involved to do unicode or capturing properly and they aren't suitable for patterns with backreferences.' -- Also interesting to note that he mentions an approach I've used in several SpamAssassin speedup add-ons, too ;)
performance  perl  regular-expressions  perlmonks  demerphq  regexps  dfa  nfa  state-machines 
april 2011 by jm

Copy this bookmark: