I'm a former Site Reliability Engineer at Facebook (albeit one from many moons ago).
Here's my modest take on #facebookdown :

Assuming the account of events below is true (it does sound plausible) we could still be looking at a few more hours of outage, here's why ⤵️
1/n
What happened ?
It appears Facebook has inadvertently cut itself off from the rest of the Internet. More accurately, it mistakenly removed every "road sign" worldwide pointing at its network.
The "why" will be a very interesting post mortem to read. How hard is it to fix ?
2/n
It depends on 2 factors :
1. Whether there's a *workable* backchannel for remote engineers to access not only the systems themselves, but crucially the comms tools and documentation they need.
2. How well rehearsed of a disaster recovery plan they had for this kind of issue.
3/n
You can reasonably assume they had some sort of emergency out of band remote access set up.
But do they still have access to all the fancy internal comms + incident management tools + documentation right now ?
This is less certain and could slow down remediation hugely.
4/n
How often did they run drills for a "network down" situation ? Did they have contact numbers in their phones ? Documentation printouts with them ?

Honestly, an outage like this one is so far fetched that it's unlikely they would have had 100% of bases covered.
5/n
WFH, of course, would have made this worse.
When the tools you rely on for your daily comms with colleagues are unavailable, it adds an extra burden to the already sky high cognitive load of troubleshooting a thorny and high stakes technical issue.
6/n
Needless to say this is an extremely stressful event for the engineers involved, but Site Reliability / Operations engineers are in it for the adrenaline. They will no doubt remember this day for the rest of their careers. Sparing a thought for my former colleagues !
7/7
Ok, if the below is confirmed and having also spoken to former colleagues, I'm clearly more pessimistic as to their preparedness for the issues I explained above 😓
(for my French speaking peeps)
Excellente explication plus technique de @AtaxyaNetwork ici :
Ouch. Looks like we have part of our answers regarding remote access still being workable or not :
Now on a lighter note :

Here's a screenshot I made from one of Facebook's internal tools back in the day (~2010, I hear they're more sophisticated now !)

No pressure at all on the SREs 😅

Someone must have pressed OK this morning ? 🙂
(I think I clicked OK that day, YOLO)
Pour les 🇫🇷 : je serai à priori à 20h dans @_Techco_ sur BFM Business pour revenir sur cette panne de Facebook.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Renaud Guerin

Renaud Guerin Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @RenaudGuerin

12 Mar 20
[THREAD] Aujourd'hui @olivierveran et @EmmanuelMacron feront un choix historique sur les mesures d'isolement #stade3 en France. Continuer la "réaction proportionnée" ou frapper un grand coup ? Une simulation nous éclaire l'impact de ces choix sur l'épidémie #covid19france : 1/19
Simulation US: sans mesures de distanciation l'explosion de l'épidémie est clairement exponentielle. Elle est fortement réduite avec 25% de contacts en - mais n'est maitrisée qu'avec 75%, autrement dit des mesures drastiques comme celles qu'a prises (trop tard) l'Italie. 2/19
Tout l'enjeu est de maintenir le niveau d'infections actives à un instant T sous la capacité du système de santé à les traiter : c'est le fameux #FlattenTheCurve qu'@olivierveran a vulgarisé devant @Bruce_Toussaint lundi sur @BFMTV 3/19
Read 25 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(