Here we go, last talk of the day on the production track at #QConLondon, with @rdelvira and "an entertaining outage story" (his own words) when slack rolled out DNSSEC
"Who here tried to rollout DNSSEC?, Ok one person... Now how failed when trying to rollout DNSSEC? Welcome to the club!" 😂 @rdelvira #QConLondon
"We planned DNSSEC carefully, with the necessary changes and replicated most of our DNS use cases... And you'll see later why I said 'most'..." @rdelvira #QConLondon
The challenge on DNSSEC is that it needs to be applied per domain, cannot be split per subdomain or be progressively rolled out.
The rollout was done per domain, with slack.com being the last one...causing 3 different outages @rdelvira #QConLondon
Using netlog capture to debug issues in the chromium engine used behind the Slack client. @rdelvira #QConLondon
When encountering issues with the rollout, the traffic team decided to rollbck DNSSEC, with confidence of having done it many times in testing. That wasn't taking into account the 24h cache DNS resolvers worldwide. @rdelvira #QConLondon
You know you're in a really bad spot when you need to ask all DNS resolvers operator to clear their cache for your main domain 😱. "This was a very big spreadsheet..." @rdelvira #QConLondon
For the last attempt, the traffic team at slack went back to strengthen run books (especially for very risky rollbacks), increasing observability on DNS (route53 logs for full visibility of dns requests, breakdown by resolvers). @rdelvira #QConLondon
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.