Profile picture
Colm MacCárthaigh @colmmacc
, 24 tweets, 6 min read Read on Twitter
Thursday tweet thread time! This one is all about what we do in Amazon s2n to prevent security issues similar to this week's libssh problem.
First and foremost, I don't mean to knock on libssh here or make it seem like libssh is anything other than an awesome project where people volunteer their time and thinking to improving the world's security. We analyze for interesting lessons, not to poke!
O.k. let's get into it. So in this libssh bug basically a libssh server (which are rare!) can be coaxed into letting you log in if you send it a message that says "Hey authentication succeeded".
This looks silly, and there's been knock knock jokes, but it's actually hard to catch with a code review and it's not an uncommon kind of problem. A few years back there was a similar issue in OpenSSL (and not to knock OpenSSL either!) ...
You could send it an out of order "ChangeCipherSuite" message and OpenSSL would skip to the authenticated state for a TLS/SSL connection, bypassing all sorts of important checks. State machines are hard!
It's hard to catch these in code reviews because the typical approach is to mix the code that parses messages and the code the proceeds to the next state. For example you might have a parser that handles the first "Hello" message ...
before it even runs some piece of code is probably like "hey this is a hello message, call that parser!" and then at the end that parser will have a series of branches that are basically "hey, if I'm in this mode with X enabled, go here next, otherwise go there next".
When you're reading the code locally everything makes sense, because you're not thinking of other entry points or things being out of order. There's nothing glaring there.
One way to catch these errors is to use a protocol fuzzer; something that sends messages in random orders, tries to jam things up. But those are hard to write ... the messages need to be somewhat valid to get so far, and the combinatorial space to explore can be HUGE.
In s2n we do something differently. We borrow a trick first learned from S3's Doug Lawrence, combined with another trick from @signalapp's @trevp__ .... we linearize the state machine into a static table, and we try to negotiate EVERYTHING just once. O.k. what does that mean ...
O.k. so rather than having message parsers and generators that can transition states at all, we make them standalone, all they do is parse or generate messages. So how do we know which ones to call and when?
We put everything in a table, and use function pointers, see s2n's at:…
and then we fully linearize every possible valid order of states ...…
With that approach, when we negotiate what options/modes/extensions you're going to use, we do it just once and handcuff ourselves to the corresponding linearized set of state transitions. That is the only order we will then follow. An out of order message shuts everything down.
That makes our state machine itself tiny, it's here:…

all it does is increment a value! Btw one of our coding tenets is never to mix control flow and message parsing in the same function. Big win here!
O.k. so that's the basic approach, but how do we then test it? how do we make sure that all of our valid states are correct? With some help from @galois , we formally verify it! Here's how ...
So we have a declarative definition of the valid TLS states that come from the RFC. It's written in Cryptol, and is here:…

now that's done by hand and we have to trust it, it's the "bottom turtle" in the proof, but it's easy to read and review.
We then use SAW to map our s2n_handshake C code to something that can be symbolically executed for all possible inputs. Here's the SAW code that does the mapping:…
So together, these actually check that our handshake code really does all TLS valid state transitions correctly, and only those. We caught bugs when we first added it too! Not a security issue, but it turns out we didn't support an obscure combination of OSCP and resumption.
Now my favorite part ... how do we know that the verification itself actually works? We run it on every build, but what does that say?
We *also* "verify the verifier" by forcing some negative test cases. We actually patch the code with known errors:…

and check that the formal verification fails! This is super important for checking formal verf btw, and often overlooked.
We also do fuzz tests (…) and we have integration tests with other implementations, to check for inter-op.
Our goal with tooling is to ensure that we have failsafes beyond code-review. For these state machine type bugs, it actually takes quite a lot! There's a lot more code verifying it than implementing it. It's not surprising that these issues crop up in real world software.
Anyway, that's it unless you want to AMA. I'll just ask @threadreaderapp to please unroll this thread!
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Colm MacCárthaigh
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!