12,399 views

Colm MacCárthaigh

@colmmacc

, 39 tweets, 7 min read

My Authors

Tuesday Technical Tweet Thread Time! Let's go on the roller coaster of what happens at a low level when a DNS server sends an 4,000 byte EDNS0 response to a client whose MTU is 1200 bytes. Confused already? don't worry, we'll break it down. I promise it's super interesting.

o.k. so DNS, the so-called "phonebook of the internet" (if you look the other way and ignore that that's a better metaphor for Google) ... ANYWAY ... DNS runs over the User Datagram Protocol (UDP).

If you know anything about UDP it's that it's a "Fire and Forget" protocol, it can be lossy. From the perspective of an application, you send a packet, and it gets to the other end or it doesn't. If you want reliability, you have to retry yourself or have some kind of fallback.

UDP is a "layer 4" protocol. It runs on top of the Internet Protocol .. which is a "layer 3" protocol. Scarequotes because the OSI layering model is insanely wrong and confusing in modern contexts.

Here's what an IP header looks like. Study it. There will be an exam.

And here's a UDP header ...

O.k. so back to DNS for a bit. So DNS is a request-response protocol. Requests are like "What's the IP address for amazon.com" and responses are like "here's the IP addresses for amazon.com".

IP-based networks have limits for how much data they can handle in a single packet. Now if you study the IP and UDP headers, you'll that they both have "length" fields, and these fields are 16-bits. So you'd think that we should be able to send up to 65535 bytes in a packet.

That would be *way* too simple. Historically, the size of the packets we could send has actually limited by the underlying technology. Ethernet is often limited to 1500 bytes per packet, or up to 9,000 if you enable a special mode called "Jumboframes".

Satellites, Radio systems, things like POS (Packet over Sonet), ATM, DSL, Token Ring, and more. These could all have different MTUs that actually limit how much data you could send or receive in one packet.

The underlying reasons are usually esoteric, like how long you're allowed to send signals for at the electrical level. Anyway, main point: different mediums = different MTUs. MTU stands for "Maximum Transmission Unit".

Different networks can be linked though, and you can only get a packet through all of the links if it's <= the lowest MTU of any of the links, so you need a way to discover the whole paths MTU. We'll come back to this, I promise. Placeholder for now.

There's also a minimum MTU that *any* IP network has to support. It's 576 bytes, everything has to handle that. So DNS traditionally takes a shortcut around path MTU discovery and just says "lets use UDP and just stick to the minimum".

So it says "Requests and responses have to be be smaller than 512 bytes", which when you add the 20 byte UDP header, and the 20 byte IP header, leaves 24 bytes of space for IP options. So it all fits.

Fitting a DNS request into 512 bytes isn't too hard, it's why DNS has a limit on how big domain names can be. But since humans have to type them sometimes, long ones would always be a pain. No big deal.

Fitting an entire DNS response into 512 bytes was easy to start out with .. just a few IPs, but it's gotten harder over time. Email, Anti-Spam measures, IPv6, and DNSSEC (which is garbage, but that's a different topic) have all pushed the size of responses up and up.

DNS has a built in mechanism to handle this, called truncation. If the response that a server needs to send is too big, it sends as much as it can and then marks a bit that means "This response was truncated".

The requester then retries the request over TCP, instead of UDP. Because TCP is intended for "bigger" messages. Two problems with this: sometimes people block TCP DNS without realizing, and TCP *still* has to figure out what the path MTU is.

So here's how path MTU discovery works for TCP. The TCP connection starts with its best guess of what the MTU is, and calculates a "Maximum Segment Size" from this. This isn't the TCP window. It's the maximum data that TCP will put in a single packet.

The client sends SYN, the server sends SYN|ACK, usual TCP handshake. These packets are small and rarely trigger path MTU discovery. For DNS; the request packet will be small too and likely won't trigger anything.

But when the server sends the response, that will be big, maybe too big. But it sends it out anyway. That packet gets as far as it gets. If it gets to a link that has a smaller MTU, that link sends back an Internet Control Messsage Protocol (ICMP) message that says "MTU exceeded"

It sends to this back to the sender ... the DNS server in this case. And so the sender has a clue what even triggered the error, it includes the top part of the packet that triggered the error.

Basically the server tried to send a long letter by relay mail, and got to a relay that only has small envelopes. So that relay tore the original letter in half and send back that half in a small envelope saying "I only have small envelopes, so try again, good luck".

O.k. so now the sender knows a new lower MTU for that path, and it caches it. The kernel keeps a cache of all MTUs it knows for all destinations. Now it recomputes a new TCP MSS and resends the data but in smaller packets. Yay! This is part of why TCP is reliable.

This whole process takes a damn while though; the client had to fall back to TCP, and then the path mtu discovery had to happen, and finally the client gets the response ... as long as TCP wasn't blocked to begin with.

So the DNS folks are like "Wouldn't it be great if we could just send large responses over UDP". And it's true. It would. And so it was invented, as part of EDNS0, a general extension mechanism for DNS.

With EDNS0 enabled, the client can include a little fake sort of record that says "hey, I support EDNS0, and also I'm good if you send me large UDP responses ... up to 4,000 bytes, say".

If a server also supports EDNS0, it can just send larger UDP responses, rather than truncating and falling back to TCP. We're now nearing the top of the rollercoaster.

What happens when we send a large UDP response? Well as we saw earlier the UDP and IP headers both have length fields. The way it works out is that fragmentation actually happens at the IP layer ...

Suppose we try to send a 4,000 byte UDP datagram over a 1500 byte MTU IP network, what happens is that it gets broken into 3 IP "fragments". The first contains the first part of the UDP message, including the UDP header itself, and the next packets follow on from there.

Each fragment will have the same IP "identification" header (IPID) and different fragment offsets. It's the kernel's job to wait for all of the fragments to show up and pass them on to an application.

Path MTU discovery works the same as with TCP. If the first fragment is too big, the sender will get an error. But since applications usually don't retry UDP responses, it can also just show up as that message being entirely dropped.

It also means that fragments of the message look strange to many parts of the network, like firewalls and switches. UDP packets without UDP headers! How are they supposed to know whether the packet is allowed or not? How is flow-switching supposed to work? The internet shrugs.

We often end up building in the ability to relate these packets to one another in many places, and since that's expensive, they also have to be rate-limited. It's deeply messy.

That's what's going on down there though, and understanding that full picture is key to diagnosing some common network mysteries. O.k. I have a team meeting now, more later!

Meeting postponed! o.k. so what happens when the DNS server sends a 4K response and the MTU is 1200 bytes? Well the DNS server gets the error, and it fixes the *next* response, but that first is lost. Super annoying. So the client has to retry, but then it works.

Some more fun: network paths don't have to be symmetrical, so MTUs don't either. If the client needs to send a large amount of data, this whole process happens for them too. A sender and a receiver can legitimately end up with different limits towards one another.

Because MTU discovery depends on state, and on ICMP messages being allowed, some folks do something like "MSS clamping" where for TCP they have the network actually meddle with the TCP connection (a little) to offer a different MSS to the other end.