jm + text   27

Hyperscan
a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre library, yet functions as a standalone library with its own API written in C. Hyperscan uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions, as well as matching of regular expressions across streams of data. Hyperscan is typically used in a DPI library stack.

Hyperscan began in 2008, and evolved from a commercial closed-source product 2009-2015. First developed at Sensory Networks Incorporated, and later acquired and released as open source software by Intel in October 2015. 

Hyperscan is under a 3-clause BSD license. We welcome outside contributors.


This is really impressive -- state of the art in parallel regexp matching has improved quite a lot since I was last looking at it.

(via Tony Finch)
via:fanf  regexps  regular-expressions  text  matching  pattern-matching  intel  open-source  bsd  c  dpi  scanning  sensory-networks 
5 weeks ago by jm
Letters and Liquor
These are lovely! (via Ben)
Letters and Liquor illustrates the history of lettering associated with cocktails. From the 1690s to the 1990s, I’ve selected 52 of the most important drinks in the cocktail canon and rendered their names in period-inspired design. I post a new drink each week with history, photos and recipes. Don’t want to miss a single cocktail? Click here for email updates.
cocktails  text  letters  typography  graphics  history  booze 
11 weeks ago by jm
Hyperscan
a high-performance multiple regex matching library. Hyperscan uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions and for the matching of regular expressions across streams of data.


Via Tony Finch
via:fanf  regexps  regex  dpi  hyperscan  dfa  nfa  hybrid-automata  text-matching  matching  text  strings  streams 
october 2015 by jm
Dark corners of Unicode
I’m assuming, if you are on the Internet and reading kind of a nerdy blog, that you know what Unicode is. At the very least, you have a very general understanding of it — maybe “it’s what gives us emoji”.

That’s about as far as most people’s understanding extends, in my experience, even among programmers. And that’s a tragedy, because Unicode has a lot of… ah, depth to it. Not to say that Unicode is a terrible disaster — more that human language is a terrible disaster, and anything with the lofty goals of representing all of it is going to have some wrinkles.

So here is a collection of curiosities I’ve encountered in dealing with Unicode that you generally only find out about through experience. Enjoy.
unicode  characters  encoding  emoji  utf-8  utf-16  utf  mysql  text 
september 2015 by jm
minimaxir/big-list-of-naughty-strings
Late to this one -- a nice list of bad input (Unicode zero-width spaces, etc) for testing
testing  strings  text  data  unicode  utf-8  tests  input  corrupt 
august 2015 by jm
Levenshtein automata can be simple and fast
Nice algorithm for fuzzy text search with a limited Levenshtein edit distance using a DFA
dfa  algorithms  levenshtein  text  edit-distance  fuzzy-search  search  python 
june 2015 by jm
CommonMark
A strongly specified, highly compatible implementation of Markdown
reference  markdown  commonmark  specs  formatting  text  compatibility 
may 2015 by jm
Input: Fonts for Code
Non-monospaced coding fonts! I'm all in favour...
As writing and managing code becomes more complex, today’s sophisticated coding environments are evolving to include everything from breakpoint markers to code folding and syntax highlighting. The typography of code should evolve as well, to explore possibilities beyond one font style, one size, and one character width.
input  fonts  via:its  typography  code  coding  font  text  ide  monospace 
may 2015 by jm
attacks using U+202E - RIGHT-TO-LEFT OVERRIDE
Security implications of in-band signalling strikes again, 43 years after the "Blue Box" hit the mainstream.

Jamie McCarthy on Twitter: ".@cmdrtaco - Remember when we had to block the U+202E code point in Slashdot comments to stop siht ekil stnemmoc? https://t.co/TcHxKkx9Oo"

See also http://krebsonsecurity.com/2011/09/right-to-left-override-aids-email-attacks/ -- GMail was vulnerable too; and http://en.wikipedia.org/wiki/Unicode_control_characters for more inline control chars.

http://unicode.org/reports/tr36/#Bidirectional_Text_Spoofing has some official recommendations from the Unicode consortium on dealing with bidi override chars.
security  attacks  rlo  unicode  control-characters  codepoints  bidi  text  gmail  slashdot  sanitization  input 
april 2015 by jm
Archie Markup Language (ArchieML)
ArchieML (or "AML") was created at The New York Times to make it easier to write and edit structured text on deadline that could be rendered in web pages, or more specifically, rendered in interactive graphics. One of the main goals was to make it easy to tag text as data, without having type a lot of special characters. Another goal was to allow the document to contain lots of notes and draft text that would not be read into the data. And finally, because we make extensive use of Google Documents's concurrent-editing features — while working on a graphic, we can have several reporters, editors and developers all pouring information into a single document — we wanted to have a format that could survive being edited by users who may never have seen ArchieML or any other markup language at all before.
aml  archie  markup  text  nytimes  archieml  writing 
march 2015 by jm
Standard Markdown
John Gruber’s canonical description of Markdown’s syntax does not specify the syntax unambiguously. In the absence of a spec, early implementers consulted the original Markdown.pl code to resolve these ambiguities. But Markdown.pl was quite buggy, and gave manifestly bad results in many cases, so it was not a satisfactory replacement for a spec.

Because there is no unambiguous spec, implementations have diverged considerably. As a result, users are often surprised to find that a document that renders one way on one system (say, a GitHub wiki) renders differently on another (say, converting to docbook using Pandoc). To make matters worse, because nothing in Markdown counts as a “syntax error,” the divergence often isn't discovered right away.

There's no standard test suite for Markdown; the unofficial MDTest is the closest thing we have. The only way to resolve Markdown ambiguities and inconsistencies is Babelmark, which compares the output of 20+ implementations of Markdown against each other to see if a consensus emerges.

We propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests to validate Markdown implementations against this specification. We believe this is necessary, even essential, for the future of Markdown.
writing  markdown  specs  standards  text  formats  html 
september 2014 by jm
How Emoji Get Lost In Translation
I recently texted a friend to say how I was excited to meet her new boyfriend, and, because "excited" doesn't look so exciting on an iPhone screen, I editorialized with what seemed then like an innocent "[dancer]". (Translation: Can't wait for the fun night out!) On an Android phone, I realized later, that panache would have been a put-down: The dancers become "[playboy bunny]." (Translation: You’re a Playboy bunny who gets around!)
emoji  icons  graphics  text  speech  phones 
june 2014 by jm
Whiteboard Picture Cleaner

This [shell one-liner] will take a picture of a whiteboard and use parts of the ImageMagick library with sane defaults to clean it up tremendously.: convert "$1" -morphology Convolve DoG:15,100,0 -negate -normalize -blur 0x1 -channel RBG -level 60%,91%,0.1 "$2"


Some kind soul has put up a quickie web UI here: http://api.o2b.ru/whiteboardcleaner
graphics  tools  whiteboard  imagemagick  text  images  cleanup  gimp  photoshop  via:fanf 
june 2014 by jm
A dive into a UTF-8 validation regexp
Once again, I find myself checking over the UTF-8 validation code in websocket-driver, and once again I find I cannot ever remember how to make sense of this regex that performs the validation. I just copied it off a webpage once and it took a while (and reimplementing UTF-8 myself) to fully understand what it does. If you write software that processes text, you’ll probably need to understand this too.
utf-8  unicode  utf8  javascript  node  encoding  text  strings  validation  websockets  regular-expressions  regexps 
june 2014 by jm
"Replicated abstract data types: Building blocks for collaborative applications"
cited at https://news.ycombinator.com/item?id=7737423 as 'one of my favorite papers on CRDTs and provides practical pseudocode for learning how to implement CRDTs yourself', in a discussion on cemerick's "Distributed Systems and the End of the API": http://writings.quilt.org/2014/05/12/distributed-systems-and-the-end-of-the-api/
distcomp  networking  distributed  crdts  algorithms  text  data-structures  cap 
may 2014 by jm
Transform any text into a patent application
'An apparatus and device for staring into vacancy. The devices comprises a good cage, a narrow gangway, an electric pocket, a flower-bedecked cage, an insensitive felt.' (The Hunger Artist by Kafka)
python  patents  text  language  generator 
may 2014 by jm
An IDE is not enough
Very thought-provoking response to that 'Light Table' demo which went round the aggregators a couple of weeks back. 'The fundamental reason IDEs have dead-ended is that they are constrained by the syntax and semantics of our programming languages. Our programming languages were all designed to be used with a text editor. It is therefore not surprising that our IDEs amount to tarted-up text editors. Likewise our programming languages were all designed with an imperative semantics that efficiently matches the hardware but defies static visualization. Indeed it would be a miracle if we could slap a new IDE on top of an old language and magically alter its syntactic and semantic assumptions. I don’t believe in miracles. Languages and IDEs have co-evolved and neither can change without the other also changing. That is why three years ago I put aside my IDE work to focus on language design. Getting rid of imperative semantics is one of the goals. Another is getting rid of source text files (as well as ASTs, which carry all the baggage of a textual encoding minus the readability). This has turned out to be really really hard. And lonely – no one wants to even talk about these crazy ideas. Nevertheless I firmly believe that so long as we are programming in decendants of assembly language we will continue to program in descendants of text editors.' (via Chris Horn)
via:cjhorn  ide  programming  coding  programming-languages  semantics  syntax  source-code  text 
may 2012 by jm
Hipster Ipsum
'Adipisicing do Tumblr fugiat vinyl Pitchfork. Organic tempor laboris, esse Tumblr irure eu nostrud. Dolor Cosby sweater mustache qui consequat incididunt. McSweeney's ullamco occaecat Wes Anderson. Minim aute lomo, duis ea proident enim Carles. Eiusmod culpa photo booth ex. Pariatur incididunt minim qui, dolor Pitchfork wayfarers mollit vinyl fixie.' (via boogah)
via:boogah  hipster  lorem-ipsum  filler  text  markov-chains  funny  humour 
june 2011 by jm
TextAid - Google Chrome extension
"It's All Text" for Chrome. annoyingly, Chrome blocks forking of processes by extensions, so a daemon process (provided) needs to be running separately, but otherwise it works nicely. Particularly nice is that the daemon is just written in dependency-hell-free perl rather than Node.JS ;)
text  editing  chrome  extensions  add-ons  browsers  web 
may 2011 by jm
boilerpipe
extract the non-boilerplate part of a web page
boilerplate  web  html  page  text  scraping  from delicious
november 2010 by jm
Structural Regular Expressions
'The current UNIX text processing tools are weakened by the built-in concept of a line. There is a simple notation that can describe the `shape' of files when the typical array-of-lines picture is inadequate. That notation is regular expressions. Using regular expressions to describe the structure in addition to the contents of files has interesting applications, and yields elegant methods for dealing with some problems the current tools handle clumsily. When operations using these expressions are composed, the result is reminiscent of shell pipelines.' Paper by Rob Pike, via adulau. intriguing
sregex  via:adulau  regexp  rob-pike  regex  library  text  structural  parsing  from delicious
november 2009 by jm
sregex - Structural Regular Expressions
'The sregex module implements Structural Regular Expressions.' Python, Apache-licensed
sregex  python  via:adulau  regexp  robpike  regex  library  text  structural  parsing  from delicious
november 2009 by jm
Ag Tweet: Paying Customers
pay EUR3 per month to receive Twitter @replies to your SMS mobile in Ireland -- a good niche
twitter  agtweet  ireland  mobiles  sms  text  revenue  from delicious
september 2009 by jm
Thunderbird "open in external editor" add-on
Seems to work nicely. Not quite as cleanly integrated as It's All Text! for Firefox, but getting there
thunderbird  editing  vim  emacs  gvim  its-all-text  mail  text  extensions  add-ons  plugins 
august 2009 by jm

related tags

add-ons  agtweet  algorithms  aml  analysis  archie  archieml  ascii  ascii-art  attacks  bidi  boilerplate  booze  boxes  browsers  bsd  c  cap  characters  chrome  cleanup  cocktails  code  codepoints  coding  commonmark  communication  compatibility  control-characters  cool  corrupt  crdts  crf++  data  data-structures  dfa  diagrams  distcomp  distributed  dpi  edit-distance  editing  emacs  emoji  encoding  extensions  feature-extraction  filler  font  fonts  formats  formatting  funny  fuzzy-search  generator  gimp  gmail  graphics  gvim  hipster  history  html  humour  hybrid-automata  hyperscan  icons  ide  imagemagick  images  input  instagram  intel  internet  ireland  its-all-text  javascript  language  letters  levenshtein  library  lorem-ipsum  machine-learning  mail  markdown  markov-chains  markup  matching  mobiles  monospace  mysql  networking  nfa  nlp  node  nytimes  open-source  page  parsing  patents  pattern-matching  phones  photoshop  plugins  probabilistic  programming  programming-languages  python  recipes  reference  regex  regexp  regexps  regular-expressions  revenue  rlo  rob-pike  robpike  sanitization  scanning  scraping  search  security  semantics  sensory-networks  slashdot  sms  source-code  specs  speech  sregex  standards  streams  strings  structural  syntax  testing  tests  text  text-matching  thunderbird  tools  trends  twitter  typography  unicode  utf  utf-8  utf-16  utf8  validation  via:adulau  via:akohli  via:boogah  via:cjhorn  via:fanf  via:its  vim  visualization  web  websockets  whiteboard  writing 

Copy this bookmark:



description:


tags: