Account Share

 

Thread by @colmmacc: "O.k., tweet-thread time! We have to to talk about constant-time systems! You might think that's all about cryptography and security, but act […]"

30 tweets, 5 min read
O.k., tweet-thread time! We have to to talk about constant-time systems! You might think that's all about cryptography and security, but actually this is about availability! Bonus content: Maxwell's Daemon. It's good stuff! NO REFUNDS.
O.k. so some basics ... constant-time algorithms, code, designs, whatever are often notated as O(1) ... which means that there's no relationship between the time taken and the input. For example a pregnancy is O(1), whether you're pregnant with one kid, or twins or triplets.
O(1) is often also short-hand for constant-work; where we spend the same amount of energy or whatever, no matter the inputs/outputs.
Being O(1) often matters for cryptography and security. For example if you check a password sequentially ...

for (int i = 0; i < pwlen; i++) {
if (input[i] != pw[i]) {
return FAILURE;
}
}
return SUCCESS;

that's not O(1) ... and it's bad.
it's bad because it takes longer to run if the password nearly matches, and attackers can measure that, and this is the classic example of a side-channel, and byte by byte they can totally crack your password like this just by measuring time.
The fix is to make it O(1) with awful bitwise operations. It's ugly and mean and hard to follow, but something like:

difference = 0;
for (int i = 0; i < pwlen; i++) {
difference |= pw[i] ^ input[I];
}
return difference == 0;

don't worry if you don't understand this.
Anyway, so Crypto people have been studying this for a long time. Our Automated Reasoning Group built a tool to analyze code and tell you if it's O(1) ... or how close it is ... github.com/awslabs/s2n/tr…
But it turns out that O(1) patterns matter for distributed systems and availability EVEN MORE! Completely unrelated fields really, it's surprising, so let's dig in ...
O.k. so imagine a very simple control plane. Let's say you have a bunch of servers and they do things for customers and their users. Customer gets some knobs and dials, a configuration they can edit. This is lots and lots of web services.
So customer makes a change. What do we do? Well it goes to an API, and that API produces a diff or a delta or whatever you want to call it and sends it to the servers. The servers then apply the patch, and the new config is in place. SIMPLE, RIGHT?
Well, no, sometimes servers are down and errors happen, so you need a workflow engine to drive retries. Oh and that implies you have some way to tell if the change even made it there, so you need a poller or a pusher or something to monitor config propagation status. YOU GET ME.
But then we build all that, and it works, and changes get integrated and customer configs change in seconds, and WE'RE GOOD, RIGHT?
We're good until one day we're not. Imagine a big event happens, maybe a power outage, or a spike on the internet due to the Super Bowl or something, and for whatever reason a bunch of customers all try to make changes at the same time?
The API gets choked, the workflow gets backed up, the status monitors start to lag. OUCH. Even worse, some customers start undoing their changes because they don't see them happening quickly. Now we have pointless changes stuck in the system that no-one even wants!! OUCH OUCH.
Now, let's try a much better, O(1) control plane. DUMB AS ROCKS. Imagine if instead that the customer API pretty much edits a document directly, like a file on S3 or whatever. And imagine the servers just pull that file every few seconds, WHETHER IT CHANGED OR NOT.
This system is O(1). The whole config can change, or none of it, and the servers don't even care. They are dumb. They just carry on. This is MUCH MUCH more robust.
It never lags, it self-heals very quickly, and it's always ready for a set of events like everyone changing everything at once. The SYSTEM DOES NOT CARE.
This is how we build the most critical control planes at AWS. For example if you use Route 53 health checks, or rely on NLB, or NAT GW, the underlying health statuses are *ALWAYS* being pushed around as a bitset. ALWAYS ALWAYS.
A few statuses can change (which is normal), and the system reacts, or they can ALL change, and it will react just the same (which is abnormal). We could have lots of targets suddenly disappear .... break ...
"I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened."
... end break ... and the system does not care or even know how many changed! It just works. Very very reliable, and unde-rapreciated.
BONUS CONTENT: O.k., what does this have to do with Maxwell's Daemon? WHAT EVEN IS THAT? Well the first kind of constant time problem, for crypto, is an example of minimizing information theoretical entropy. It's about minimizing perceptible disorder and what is computable.
The second is more like traditional thermodynamic entropy. With a constant-time control plane, we're trying to build a system that has a constant temperature, because it's always doing the same amount of work.
Ok now if I lost you with that, here's the simple version: when systems change a lot, they undergo stress, and getting stressed at the worst times, like under attacks or in the middle of outages, is really bad. At a high level these problems are the same.
And it turns out at a low level they are the same too! Information theoretical entropy, which is all about computability and bits, can be related to thermodynamic entropy, which is all about energy and work!
How? Well Maxwell proposed a thought experiment that showed a problem with thermodynamics. He imagined two rooms side by side, at even temperatures. He posited that a daemon, a ghost, could open a door between the two rooms and let faster moving molecules over to one side.
This sort of disproved existing thermodynamics, because the work it takes to open a door like that is less than the energy transferred due to the faster molecules changing sides. This was unresolved for a long time ... UNTIL ...
A bunch of folks showed that you don't just open the door, you have to observe and predict, and therefore COMPUTE, the position of the molecule. THIS TAKES ENERGY ... and they showed that the energy it takes is at least equivalent to the difference.
So in a big old nuts, strange-loop sort of way, these completely unrelated things are totally the same at both a MACRO and MICRO level. This blows my mind .... but may also make no sense ;-)
O.k. that's it, and AMA if you want! Oh and @threadreaderapp please unroll this mess I made.
Missing some Tweet in this thread?
You can try to force a refresh.
This content can be removed from Twitter at anytime, get a PDF archive by mail!
This is a Premium feature, you will be asked to pay $30.00/year
for a one year Premium membership with unlimited archiving.
Don't miss anything from @colmmacc,
subscribe and get alerts when a new unroll is available!
Did Thread Reader help you today?
Support me: I'm a solo developer! Read more about the story
Become a 💎 Premium member ($30.00/year) and get exclusive features!
Too expensive?
Make a small donation instead. Buy me a coffee ($5) or help for the server cost ($10):
Donate with 😘 Paypal or  Become a Patron 😍 on Patreon.com
Trending hashtags
Did Thread Reader help you today?
Support me: I'm a solo developer! Read more about the story
Become a 💎 Premium member ($30.00/year) and get exclusive features!
Too expensive?
Make a small donation instead. Buy me a coffee ($5) or help for the server cost ($10):
Donate with 😘 Paypal or  Become a Patron 😍 on Patreon.com