ELI5: Today's Facebook outage (based on no inside info). Internet is designed in layers, like house built on a foundation. No coordination between layers - but there's dependency. 1st meaningful part is "layer 3" - the Internet Protocol (IP) network layer. [thread] /1
The DNS generally (not always) depends on what is known as the User Datagram Protocol (UDP) for transport - which is layer 4 on top of IP at layer 3. Then the DNS protocol itself happens at layer 7 on top of IP addressing/routing and UDP transport. /2
Specifically most DNS runs on UDP port 53 (aka UDP/53). There are different "port" numbers - unique numbers - assigned for specific uses. This makes operations/troubleshooting/interop easier. Find 'em all at iana.org/assignments/se… /3
So DNS servers can be perfectly functional at layer 7 but if you can't get to them due to a lower level problem in the protocol stack, then you are out of luck. /4
Quick aside: while "it's always DNS" is often & sadly (as a DNS person) correct in various outages - that's not because it is unreliable as a protocol - but because so very much depends on it & transaction volumes (lookups) are quite high so issues/errors get amplified. /5
In this case (again, I have *no* inside knowledge) it appears it may be an IP routing (layer 3) issue based on many people smarter than me. Until that IP network layer - which runs in network routers - is working, then nothing in the upper layers that depend on it will work. /6
IP routing information is summarized, announced, and distributed using the Border Gateway Protocol (BGP). This enables one network to tell other networks that "I own X and Y IP address ranges, so send those packets to me, and use these routes to do so". /7
As noted by Cloudflare's CTO @jgrahamc - there were a bunch of BGP changes just before the DNS problem occurred. This *may* suggest an erroneous BGP (IP routing) change.
Because stuff breaks on the Internet, engineers have built a lot of ways to independently troubleshoot & diagnose things on a distributed basis. One neat DNS tool is @dnsviz so check out a visual of the current status of the Facebook domain at dnsviz.net/d/facebook.com… /12
There are also tools to visualize & check BGP info and changes but I'll let others fill in that info... /13
I have worked countless outages over 25+ years & it can be horrible and stressful, so my sympathies to the ops teams working this outage! #hugops /end
some more views - via the sharp team at @thousandeyes
@thousandeyes Oh - back in 'ye old Internet' days, ppl made router changes at a command line interface (CLI) on each router, 1 by 1. These days that is usually centralized & automated - so you make 1 change to a template & push it to 1,000s of routers/servers - so things can happen quickly.
@thousandeyes Of course network & systems engineers know this, so typically such tools let you do controlled distribution (aka soak) of a change to see what happens (because a QA lab is never 100% same as production), do A/B tests, gradually deploy changes, manage rollback, schedule, etc.
Here's another more technical take from DNS, down the authoritative DNS chain, to BGP: isc.sans.edu/forums/diary/F…
BTW, typical outage triage boils down to: (1) hi - who knows what is happening right now / what stats are we seeing (facts), (2) if it started at XX:XX UTC, what changes happened just prior to that?, (3) let's roll that change back right now, (4) working again?, (5) write it up.
Unfortunately there's a "fog of war" that happens & all sorts of false leads may be chased. IMO many issues boil down to either (1) a bad change pushed into systems or (2) insufficient capacity in some component part of a complex system. YMMV - the Internet is a complex organism
Oh look, is that a reachable authoritative DNS server I see now? ;-)
Looking better by the moment
Yup...
Ok, so if layer 3 (IP) is back, and layer 4 & 7 for DNS (UDP/53) are working again, then we all need that other important layer 7 app we know and love as "the web" (HTTP). That may take a little more time as each layer has to stabilize before the next one can return to normal
Pet peeve in terminology you often see in the press is that the Internet is not the Web. The web is just one component of a larger Internet. But I digress. ;-)
Now I log off to go have dinner with a good friend at his back yard fire pit. Our friendship predates Facebook & will live on in the future on whatever platforms arise after Facebook. Time to #optoutside - go spend some device-free time w/friends/family/pets or wander in nature🙂
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I see reports of issues accessing Parler. No one is blocking; whoever manages Parler's DNS let their encryption key expire on 4/17/21 at 07:44 UTC. DNS security is failing - as designed. Most key rollover failures indicate monitoring & automation gaps. See dnsviz.net/d/parler.com/Y…