erratarob.bsky.social Profile picture
Feb 16, 2022 17 tweets 5 min read Read on X
So I did a thing.

Back a couple years ago, people were rewriting the classic 'wc' program (word-count) in their favorite programming language to prove theirs could be as fast as C.

So I decided to rewrite using my favorite algorithm instead: a "state machine parser". Image
The algorithm to count words (and lines and characters) is 3 lines long, the while(){} loop at line 25.

You are supposed to marvel at how this is absolutely NOT a word/line/char counting algorithm -- and yet, it produces the same results as 'wc'. Image
I implemented the same algorithm in JavaScript, and it ended up being faster than all those "I rewrote wc in my favorite language" examples. But the reason isn't that JavaScript is faster than their language, but because the ALGORITHM is faster. It also jits well. Image
The above does ASCII, which obviously is trivial. Everybody likes to cherry pick the easy problems that best demonstrate the superiority of their language/algorithm -- that then fails in the real world.

So we let's do UTF8 instead. UTF8 has it's own nasty parsing logic.
I pose this as a challenge to all those "rewrote wc in my favorite language" posts: now do UTF8 and see if your language is any better than C.
In the case of UTF8, my "state-machine parsing" algorithm doesn't change. Instead, the underlying state-machine just gets much bigger (MUCH bigger). I still use the same three lines:
1. get byte
2. do state machine transition
3. do counts Image
So let's test this using a unicode string. I think a fun example would be using the Ogham script from 4th century Ireland that only exists today on around 400 tombstones/monuments, such as this:
᚛ᚋᚐᚊ ᚉᚓᚏᚐᚅᚔ ᚐᚃᚔ ᚐᚈᚆᚓᚉᚓᚈᚐᚔᚋᚔᚅ᚜
Despite not being used since the 4th century, tech today recognizes this. For example, as you can see in Word, there are 4 separate words -- the spell checker has underlined each separate word. Thus, 'wc' should count 4 words. Image
As you can see, the original 'wc' program counts 1 line, 4 words, and 30 characters.

And so does the 'wc2' program using state-machine parsing. Image
So let's benchmark the UTF8 version against the original 'wc' program. I use 4 files:
1. a file containing illegal characters (like a previous edition of PoC||GTFO)
2. a file containing UTF8 sequences
3. ASCII words
4. 18 gigs of only the space character Image
Again, I don't want to cherry pick things, I want to show worst case (files containing illegal stuff) and best case (only space character). The original 'wc' has different speeds for different input, my 'wc2' algorithm has constant speed. Image
Note that while the JavaScript version of my algorithm is slower than the same algorithm in C, it's faster than the classic 'wc' program written in C.

(Note: my JavaScript coding skills are weak, maybe I could write it so it JITs better) Image
Anyway, "state-machine parsers" are a thing. They are something you should be learning in university courses. They can handle a surprising amount of complexity, and are very fast -- pretty much the fastest parsers that don't use SIMD.
In this thread, instead of cherry picking the easiest problem to solve, I went after a particular hard problem to solve. I have to parse UTF8 before I can parse 'wc'.
Anyway, download your latest 'pocorgtfo21.pdf' from your favorite samizdat site today.
MD5 (pocorgtfo21.pdf) = 7f3d3b147a4ba2c099ea3d252ed4a0d7
BTW, there's a little formatting problem. You can MOSTLY copy/paste code directly from the pocorgtfo21.pdf. The quotes have gone awry, and a space was added, so you have to edit it a bit.

But the fact that you can still work with 4th century Ogham script is AMAZING. ImageImage
Damn, the formatting lost a line -- the 'counts[3]' declaration disappeared.

But the original source is contained within the PDF, which you can extract thusly. (Among the oddities of PoC||GTFO is the fact that the PDF is also a ZIP).

Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with erratarob.bsky.social

erratarob.bsky.social Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ErrataRob

Nov 16, 2024
🧵So let's talk about the difficulties Netflix is having streaming the Tyson v Paul fight, how the stream gets from there to your TV/computer. This will a longish thread.
In 1985 on his first fight, TV technology was based upon "broadcasts". That meant sending one copy of a video stream to thousands, often millions of receivers. A city would send the signal to a radio tower and broadcast that signal across a wide area.
In today's Internet, though, everybody gets their own stream. There is no broadcasting, no sharing of streams. Every viewer gets their own custom stream from a Netflix server. That we can get so many point-to-point stream across the Internet is mind boggling.
Read 24 tweets
Sep 17, 2024
By the way, the energy density of C4 is 6.7 megajoules/kilogram.
The energy density of lithium-ion batteries is about 0.5 megajoules/kilogram.
C4 will "detonate" with a bang.
Lithium-ion batteries will go "woosh" with a fireball, if you can get them to explode. They conflagrate rather than detonate. They don't even deflagrate like gun powder.
To get a lithium-ion battery to explode (in a fireball) at all, you have to cause physical damage, overcharge it, or heat it up.
Causing heat is the only way a hacker could remotely cause such an event.
Read 8 tweets
Jul 21, 2024
I don't want to get into it, but I don't think Travis is quite right. I mean, the original 25million view tweet is full of fail and you should always assume Tavis is right ....

...but I'm seeing things a little differently.
🧵1/n
2/n
DON'T TRY THIS AT HOME

I'm a professional, so I can take the risk of disagreeing with Tavis. But this is just too dangerous for non-professionals, you'll crash and burn. Even I am not likely to get out of this without some scrapes.
3/n
To be fair, we are all being lazy here. We haven't put the work in to fully reverse engineer this thing. We are just sifting the tea leaves. We aren't looking further than just these few lines of code. Image
Read 14 tweets
Jun 18, 2024
The reason IT support people are so bitter is that YOU (I mean YOU) cannot rationally describe the problem:

You: The Internet is down
IT: How do you know the Internet is down?
You: I can't get email.
IT: Is it possible that the email servers are down and the Internet is working just fine? Can you visit Twitter on your browser?
You: Yes, I can visit the twitter website.
IT: Is there any reason other than email to believe the Internet is down?
You: The last time I couldn't get email it was because the Internet was down.

The fact that IT doesn't call you a blithering idiot on every support call demonstrates saintly restraint, even if a little bit of their frustration leaks through.
A lot of good replies to my tweet, but so far this is the best:
I very much like this rebuttal. I was think of "driving a car" analogy, but this tweet says it much better.
Read 5 tweets
Apr 12, 2024
Uh, no, by any rational measure, only Trump has had respect for the forum.

Televised debates aren't about "debate" but charisma and media training, where they craft an answer regardless of whether they believe it.

Trump is the only candidate who gives sincere answers.
Trump is pure evil, the brutality of his answers appeals to ignorant brutes who reject all civilized norms.

But the yang to Trump's yin is a liberal elite like Rosen whose comfortable with the civilized norm of lying politicians who play this game of deceitful debates.
To be fair, Biden (and Obama and Bush before him) have stood up for important democratic principles, the ones that Trump flatly reject. But still, the system has gotten crusty. There's no reason to take presidential debates seriously as Rosen does.
Read 4 tweets
Mar 21, 2024
I've read through it.

It's the same as all Ben Cotton's analysis's, looking for things he doesn't understand and insisting these are evidence of something bad, that the only explanation is his conspiracy-theory.

I can't explain the anomalies he finds, either, but in my experience as a forensics expert, I know that just because I can't explain it doesn't mean there isn't a simple explanation.

For example, he points to log messages about mismatched versions. I know from experience that such messages are very common, I even see them in software that I write. It's the norm that when you build something from a lot of different software components, that they will not be perfectly synchronized.

That he would make such claims based solely on log messages of mismatched versions proves that he's really not competent -- or at least, very partisan willing to be misrepresent things.
In particular, I disagree with his description of these files. In the C#/.NET environments, creationg of new executables is common. In particular, these are represent web server files. It's quite plausible that as the user reconfigures the website, that these executables will be recreated.

I don't know for certain. I'd have to look at Dominion in more detail. I just know that if any new C#/.NET executables appear in the system that they are not automatically new software.Image
The certification process looks haphazard and sloppy to me, so it's easy for me to believe that uncertified machines were used in elections.

But nothing in Ben Cotton's report suggests to me that this happened. He's not looking for an explanation for the anomalies he finds, he already has an explanation, and is looking for things that the ignorant will believe is proof of that explanation.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(