Here we go, last talk of the day on the production track at #QConLondon, with @rdelvira and "an entertaining outage story" (his own words) when slack rolled out DNSSEC
"Who here tried to rollout DNSSEC?, Ok one person... Now how failed when trying to rollout DNSSEC? Welcome to the club!" 😂 @rdelvira#QConLondon
"We planned DNSSEC carefully, with the necessary changes and replicated most of our DNS use cases... And you'll see later why I said 'most'..." @rdelvira#QConLondon
The challenge on DNSSEC is that it needs to be applied per domain, cannot be split per subdomain or be progressively rolled out.
The rollout was done per domain, with slack.com being the last one...causing 3 different outages @rdelvira#QConLondon
Using netlog capture to debug issues in the chromium engine used behind the Slack client. @rdelvira#QConLondon
When encountering issues with the rollout, the traffic team decided to rollbck DNSSEC, with confidence of having done it many times in testing. That wasn't taking into account the 24h cache DNS resolvers worldwide. @rdelvira#QConLondon
You know you're in a really bad spot when you need to ask all DNS resolvers operator to clear their cache for your main domain 😱. "This was a very big spreadsheet..." @rdelvira#QConLondon
For the last attempt, the traffic team at slack went back to strengthen run books (especially for very risky rollbacks), increasing observability on DNS (route53 logs for full visibility of dns requests, breakdown by resolvers). @rdelvira#QConLondon
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Continuing the #QConLondon production track with @yurynino, and using visual metaphors to understand our production data in a different way.
"In our field, observability is about humans and about how humans interact with technology" @yurynino#QConLondon
Collecting metrics and signals are only one part of the solution - observability has to come with good visualisation, and engineering a solution for humans. @yurynino#QConLondon
"Observability is the capability to continuously generate and discover actionable insights based on signals from the system under observation with the goal to influence that system" and that's for both people (eg debugging) and automation (eg autoscaling) @mhausenblas#QConLondon
Observability can go beyond usual metrics, logs and traces: @mhausenblas introducing profiles and eBPF #QConLondon
First talk of the day on the #QConLondon production track, by @glenathan and a challenge: can we build observable services without logs?
"We needed to build a new service in Go, without our usual existing scaffolding in Clojure... That led to some bikeshedding but also gave a chance for experimentation!" @glenathan#QConLondon
"Before this, we spent a lot of money to know what our applications were doing in production"