ELI5: Today's Facebook outage (based on no inside info). Internet is designed in layers, like house built on a foundation. No coordination between layers - but there's dependency. 1st meaningful part is "layer 3" - the Internet Protocol (IP) network layer. [thread] /1
The DNS generally (not always) depends on what is known as the User Datagram Protocol (UDP) for transport - which is layer 4 on top of IP at layer 3. Then the DNS protocol itself happens at layer 7 on top of IP addressing/routing and UDP transport. /2
Specifically most DNS runs on UDP port 53 (aka UDP/53). There are different "port" numbers - unique numbers - assigned for specific uses. This makes operations/troubleshooting/interop easier. Find 'em all at iana.org/assignments/se… /3
So DNS servers can be perfectly functional at layer 7 but if you can't get to them due to a lower level problem in the protocol stack, then you are out of luck. /4
Quick aside: while "it's always DNS" is often & sadly (as a DNS person) correct in various outages - that's not because it is unreliable as a protocol - but because so very much depends on it & transaction volumes (lookups) are quite high so issues/errors get amplified. /5
In this case (again, I have *no* inside knowledge) it appears it may be an IP routing (layer 3) issue based on many people smarter than me. Until that IP network layer - which runs in network routers - is working, then nothing in the upper layers that depend on it will work. /6
IP routing information is summarized, announced, and distributed using the Border Gateway Protocol (BGP). This enables one network to tell other networks that "I own X and Y IP address ranges, so send those packets to me, and use these routes to do so". /7
As noted by Cloudflare's CTO @jgrahamc - there were a bunch of BGP changes just before the DNS problem occurred. This *may* suggest an erroneous BGP (IP routing) change. /8
This is also the theory put forward in some @nanog list discussion and in an apparently now-deleted Reddit post. mailman.nanog.org/pipermail/nano… /9
And the erroneous BGP change as root cause theory has another supporter in @DougMadory with always-interesting data from @kentikinc /10
Because stuff breaks on the Internet, engineers have built a lot of ways to independently troubleshoot & diagnose things on a distributed basis. One neat DNS tool is @dnsviz so check out a visual of the current status of the Facebook domain at dnsviz.net/d/facebook.com… /12
There are also tools to visualize & check BGP info and changes but I'll let others fill in that info... /13
I have worked countless outages over 25+ years & it can be horrible and stressful, so my sympathies to the ops teams working this outage! #hugops /end
@thousandeyes Oh - back in 'ye old Internet' days, ppl made router changes at a command line interface (CLI) on each router, 1 by 1. These days that is usually centralized & automated - so you make 1 change to a template & push it to 1,000s of routers/servers - so things can happen quickly.
@thousandeyes Of course network & systems engineers know this, so typically such tools let you do controlled distribution (aka soak) of a change to see what happens (because a QA lab is never 100% same as production), do A/B tests, gradually deploy changes, manage rollback, schedule, etc.
Here's another more technical take from DNS, down the authoritative DNS chain, to BGP: isc.sans.edu/forums/diary/F…
BTW, typical outage triage boils down to: (1) hi - who knows what is happening right now / what stats are we seeing (facts), (2) if it started at XX:XX UTC, what changes happened just prior to that?, (3) let's roll that change back right now, (4) working again?, (5) write it up.
Unfortunately there's a "fog of war" that happens & all sorts of false leads may be chased. IMO many issues boil down to either (1) a bad change pushed into systems or (2) insufficient capacity in some component part of a complex system. YMMV - the Internet is a complex organism
Oh look, is that a reachable authoritative DNS server I see now? ;-)
Looking better by the moment
Yup...
Ok, so if layer 3 (IP) is back, and layer 4 & 7 for DNS (UDP/53) are working again, then we all need that other important layer 7 app we know and love as "the web" (HTTP). That may take a little more time as each layer has to stabilize before the next one can return to normal
Pet peeve in terminology you often see in the press is that the Internet is not the Web. The web is just one component of a larger Internet. But I digress. ;-)
Now I log off to go have dinner with a good friend at his back yard fire pit. Our friendship predates Facebook & will live on in the future on whatever platforms arise after Facebook. Time to #optoutside - go spend some device-free time w/friends/family/pets or wander in nature🙂

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jason Livingood

Jason Livingood Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jlivingood

18 Apr
I see reports of issues accessing Parler. No one is blocking; whoever manages Parler's DNS let their encryption key expire on 4/17/21 at 07:44 UTC. DNS security is failing - as designed. Most key rollover failures indicate monitoring & automation gaps. See dnsviz.net/d/parler.com/Y…
See also datatracker.ietf.org/doc/html/draft… (expired Internet Draft but still instructive)
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(