erratarob.bsky.social Profile picture
Created (BlackICE,IPS,sidejacking,masscan). Doing (blog,code,cyber-rights,Internet-scanning). @erratarob@infosec.exchange

Feb 16, 2022, 17 tweets

So I did a thing.

Back a couple years ago, people were rewriting the classic 'wc' program (word-count) in their favorite programming language to prove theirs could be as fast as C.

So I decided to rewrite using my favorite algorithm instead: a "state machine parser".

The algorithm to count words (and lines and characters) is 3 lines long, the while(){} loop at line 25.

You are supposed to marvel at how this is absolutely NOT a word/line/char counting algorithm -- and yet, it produces the same results as 'wc'.

I implemented the same algorithm in JavaScript, and it ended up being faster than all those "I rewrote wc in my favorite language" examples. But the reason isn't that JavaScript is faster than their language, but because the ALGORITHM is faster. It also jits well.

The above does ASCII, which obviously is trivial. Everybody likes to cherry pick the easy problems that best demonstrate the superiority of their language/algorithm -- that then fails in the real world.

So we let's do UTF8 instead. UTF8 has it's own nasty parsing logic.

I pose this as a challenge to all those "rewrote wc in my favorite language" posts: now do UTF8 and see if your language is any better than C.

In the case of UTF8, my "state-machine parsing" algorithm doesn't change. Instead, the underlying state-machine just gets much bigger (MUCH bigger). I still use the same three lines:
1. get byte
2. do state machine transition
3. do counts

So let's test this using a unicode string. I think a fun example would be using the Ogham script from 4th century Ireland that only exists today on around 400 tombstones/monuments, such as this:
᚛ᚋᚐᚊ ᚉᚓᚏᚐᚅᚔ ᚐᚃᚔ ᚐᚈᚆᚓᚉᚓᚈᚐᚔᚋᚔᚅ᚜

Despite not being used since the 4th century, tech today recognizes this. For example, as you can see in Word, there are 4 separate words -- the spell checker has underlined each separate word. Thus, 'wc' should count 4 words.

As you can see, the original 'wc' program counts 1 line, 4 words, and 30 characters.

And so does the 'wc2' program using state-machine parsing.

So let's benchmark the UTF8 version against the original 'wc' program. I use 4 files:
1. a file containing illegal characters (like a previous edition of PoC||GTFO)
2. a file containing UTF8 sequences
3. ASCII words
4. 18 gigs of only the space character

Again, I don't want to cherry pick things, I want to show worst case (files containing illegal stuff) and best case (only space character). The original 'wc' has different speeds for different input, my 'wc2' algorithm has constant speed.

Note that while the JavaScript version of my algorithm is slower than the same algorithm in C, it's faster than the classic 'wc' program written in C.

(Note: my JavaScript coding skills are weak, maybe I could write it so it JITs better)

Anyway, "state-machine parsers" are a thing. They are something you should be learning in university courses. They can handle a surprising amount of complexity, and are very fast -- pretty much the fastest parsers that don't use SIMD.

In this thread, instead of cherry picking the easiest problem to solve, I went after a particular hard problem to solve. I have to parse UTF8 before I can parse 'wc'.

Anyway, download your latest 'pocorgtfo21.pdf' from your favorite samizdat site today.
MD5 (pocorgtfo21.pdf) = 7f3d3b147a4ba2c099ea3d252ed4a0d7

BTW, there's a little formatting problem. You can MOSTLY copy/paste code directly from the pocorgtfo21.pdf. The quotes have gone awry, and a space was added, so you have to edit it a bit.

But the fact that you can still work with 4th century Ogham script is AMAZING.

Damn, the formatting lost a line -- the 'counts[3]' declaration disappeared.

But the original source is contained within the PDF, which you can extract thusly. (Among the oddities of PoC||GTFO is the fact that the PDF is also a ZIP).

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling