So I did a thing.

Back a couple years ago, people were rewriting the classic 'wc' program (word-count) in their favorite programming language to prove theirs could be as fast as C.

So I decided to rewrite using my favorite algorithm instead: a "state machine parser". Image
The algorithm to count words (and lines and characters) is 3 lines long, the while(){} loop at line 25.

You are supposed to marvel at how this is absolutely NOT a word/line/char counting algorithm -- and yet, it produces the same results as 'wc'. Image
I implemented the same algorithm in JavaScript, and it ended up being faster than all those "I rewrote wc in my favorite language" examples. But the reason isn't that JavaScript is faster than their language, but because the ALGORITHM is faster. It also jits well. Image
The above does ASCII, which obviously is trivial. Everybody likes to cherry pick the easy problems that best demonstrate the superiority of their language/algorithm -- that then fails in the real world.

So we let's do UTF8 instead. UTF8 has it's own nasty parsing logic.
I pose this as a challenge to all those "rewrote wc in my favorite language" posts: now do UTF8 and see if your language is any better than C.
In the case of UTF8, my "state-machine parsing" algorithm doesn't change. Instead, the underlying state-machine just gets much bigger (MUCH bigger). I still use the same three lines:
1. get byte
2. do state machine transition
3. do counts Image
So let's test this using a unicode string. I think a fun example would be using the Ogham script from 4th century Ireland that only exists today on around 400 tombstones/monuments, such as this:
᚛ᚋᚐᚊ ᚉᚓᚏᚐᚅᚔ ᚐᚃᚔ ᚐᚈᚆᚓᚉᚓᚈᚐᚔᚋᚔᚅ᚜
Despite not being used since the 4th century, tech today recognizes this. For example, as you can see in Word, there are 4 separate words -- the spell checker has underlined each separate word. Thus, 'wc' should count 4 words. Image
As you can see, the original 'wc' program counts 1 line, 4 words, and 30 characters.

And so does the 'wc2' program using state-machine parsing. Image
So let's benchmark the UTF8 version against the original 'wc' program. I use 4 files:
1. a file containing illegal characters (like a previous edition of PoC||GTFO)
2. a file containing UTF8 sequences
3. ASCII words
4. 18 gigs of only the space character Image
Again, I don't want to cherry pick things, I want to show worst case (files containing illegal stuff) and best case (only space character). The original 'wc' has different speeds for different input, my 'wc2' algorithm has constant speed. Image
Note that while the JavaScript version of my algorithm is slower than the same algorithm in C, it's faster than the classic 'wc' program written in C.

(Note: my JavaScript coding skills are weak, maybe I could write it so it JITs better) Image
Anyway, "state-machine parsers" are a thing. They are something you should be learning in university courses. They can handle a surprising amount of complexity, and are very fast -- pretty much the fastest parsers that don't use SIMD.
In this thread, instead of cherry picking the easiest problem to solve, I went after a particular hard problem to solve. I have to parse UTF8 before I can parse 'wc'.
Anyway, download your latest 'pocorgtfo21.pdf' from your favorite samizdat site today.
MD5 (pocorgtfo21.pdf) = 7f3d3b147a4ba2c099ea3d252ed4a0d7
BTW, there's a little formatting problem. You can MOSTLY copy/paste code directly from the pocorgtfo21.pdf. The quotes have gone awry, and a space was added, so you have to edit it a bit.

But the fact that you can still work with 4th century Ogham script is AMAZING. ImageImage
Damn, the formatting lost a line -- the 'counts[3]' declaration disappeared.

But the original source is contained within the PDF, which you can extract thusly. (Among the oddities of PoC||GTFO is the fact that the PDF is also a ZIP).

Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Robᵉʳᵗ Graham

Robᵉʳᵗ Graham Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ErrataRob

Feb 16
About an hour into it, when I'm describing DNS header compression on generating the query packet for "google.com" that they'll be asking me politely to leave.
I lie. It'll take hours to get to that point, as I first explain how Chrome caches DNS names before making a request to the operating system to do DNS resolution on it's behalf -- assuming they haven't enabled DNS-over-something.
I lie. There's probably a whole day's discussion of what happens when you click with the mouse on the screen to load the page, tracing the path of execution through Windows event handlers.
Read 4 tweets
Feb 10
Note that I'm not a solid source here.
1. I experience weird disruptions trying to make calls to the Ukraine cell phone network
2. Techies (who don't want to be named) said it was because the cell network was being DDoSed.

I'm just passing along what little I know.
I do know that while cell providers are supposed to have private links to each other, I know that a lot of traffic ends up going across the Internet backone, so the scenario is plausible (though not proven).
The weirdest thing was a recorded message saying the subscriber wasn't available, in english, but breaking up severely due to disruption on the network, which as I understand it, shouldn't be possible.
Read 4 tweets
Feb 10
Your regular reminder that presidents deserve neither the credit nor the blame for things such as "inflation".

Economies around the world are experiencing inflation for the same reason: stimulus spending and disruptions in the economy due to lockdowns.
Here is inflation from the Eurozone. It has the same spike as we do. It's hard to imagine how Biden caused that.
The things that economists cite that cause the current inflation were the steps taken during Trump's administration. Stimulus happened in 2020, the effects were seen in 2021. It was sticking upwards before Biden had a chance to make any difference.
Read 5 tweets
Feb 9
I woke up last week and discovered "selfie ring lights" are now a thing and I want to go on a murder spree. Image
The number at my local barber a couple months ago: 0
The number at my local barber last week: 6
Violence is never the answer.
Violence is never justified.

Except when murder sprees are justified, such as this case.
Read 4 tweets
Feb 7
It is partisan. Because republicans refused to participate. Because most are guilty of fomenting unrest by opposing the peaceful transfer of power.
It’s been over a year and republicans still haven’t cited any evidence substantiating their claims of election fraud. Yet they still claim the election was stolen.
You are a patriot if you fight a stolen election with evidence.

Claiming an election was stolen without evidence makes you the opposite of a patriot.
Read 4 tweets
Feb 7
Fact-check: It's not a nuclear power plant. It's an old steel plant shut down before the 2008 summer Olympics because of the enormous air pollution it caused. Like power plants, steel plants need cooling towers, too.
Not all nuclear power plants have towers that look like this.

Most towers that look like this are not for nuclear power plants.
Here's an article on it:
olympics.com/ioc/news/beiji…
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

:(