#strangeloop@frankc: How Tracing Uncovers Half-Truths in Slack’s CI Infrastructure
#strangeloop@frankc: Heirarchy of needs: bottom — observability, middle — resilience, top — [too slow]
#strangeloop@frankc: Slack has grown a lot, very fast over 6 years. How do we create better engineering tools for creating, testing, deploying code and features.
#strangeloop@frankc: Slack has 60 people on internal tooling teams. User experiences in CI/CD are across multiple days, platforms, and workflows.
#strangeloop@frankc: You can create massive business results when teams are sharing a language and reasoning about problems using that shared language. How do we create a shared infrastructure “language”.
#strangeloop@frankc: Why use distributed tracing? “It’s slow” is the hardest problem to debug in distributed systems. “It’s flaky” is the problem most infrastructure teams struggle with.
#strangeloop@frankc: To reason about this, we need to create an event that contains context — a span. Hello, @honeycomb_io!
#strangeloop@frankc: Slack uses an internal CI/CD tool to bridge different systems that are involved in build, deployment, testing, etc.
#strangeloop@frankc: Cardinality for CI trases is very different from other use cases, because they have lower volume, and high criticality. You don’t need sampling, but you do need to address almost all the faults.
#strangeloop@frankc: [this talk is extremely dense, so if you want a better view of the content, download it from the StrangeLoop site tomorrow]
#strangeloop@frankc: In 2019, we realized that tehre was a variance in unexpected areas. We built a hypothesis/testing observability tool to see if it was going to be useful to address observability problems.
#strangeloop@frankc: Cross-service tracing in CI, start with a story of a bad day called “Jenkins Queue Large and Cursed”. We added tracing to try to find the problem. We found something previously uninstrumented and could address the real problem.
#strangeloop@frankc: Agreeing on dimensions that are useful to one or many teams and making those visible allows teams to reuse dimensions that might be useful to them.
#strangeloop@frankc: We worked with engineers to instrument their platforms, and then we could see what was slowing them down. We used a hypothesis to see if we needed to revert or change the things we were tweaking, but that was for confindence, we didn’t need it.
#strangeloop@frankc: Circuit breaker pattern used to gate test behavior! Fascinating. When the circuit breaker is open, the tests are deferred because they would come back flaky anyway. The tests go back on when we stabilize the build.
#strangeloop@frankc: I love that we are, as an industry, talking about sociotechnical systems as part of everything, but especially observability.
#strangeloop@frankc: ”I’d love to take questions …. except about DNS”. [context — Slack had an outage caused by DNS this week]
#strangeloop@frankc: In response to a question, we are now diving through a Honeycomb query. 🐝 💖
#strangeloop@jessitron Ian Wilkes: Oh, cool! one of the presenters on stage speaking, one of them is in the chat channel answering questions.
#strangeloop@jessitron Ian Wilkes: What is AWS Lambda good at, and what makes it painful?
#strangeloop@jessitron Ian Wilkes: Honeycomb has specific strengths, and therefore specific needs. Honeycomb is for observability: being able to find out what’s going on in your system, in production, that you didn’t know you needed to know.
#strangeloop@jessitron Ian Wilkes: Monitoring is scar tissue of things you know you’ve already had failures in. Observability is being able to ask questions in real-time about surprising events.
#strangeloop@jessitron Ian Wilkes: I decided to look at our lambdas, and I wanted to find out why all of them seem so spiky, and figure out what’s different about the slow responses?
#strangeloop@jessitron Ian Wilkes: Honeycomb allows rapid-fire interactive investigation of production behavior. But the query has to be fast to be interactive. So our datastore has to respond very quickly to un-indexed, un-pre-aggragated data. We just have the raw data.
#strangeloop@jessitron Ian Wilkes: How does Retriever do that? It’s a distributed datastore, and then the query fans out to all the disks that have data, and those are read and aggregated and sent back to the querier originator.
#strangeloop@jessitron Ian Wilkes: Every field in the event is a column, and every event has a time stamp, so they can be linked together. But when the query comes in, you only have to read the columns that are being queried.
#strangeloop@jessitron Ian Wilkes: The dynamic aggregation of any fields across any time range. We need an aggregator that can operate even faster, and with bigger data. Because we’re dealing with much bigger volumes of data.
#strangeloop@jessitron Ian Wilkes: We can’t be having these queries take a long time. We don’t need people wandering off. So we need to increase our compute so that we can handle these large, gnarly queries. We don’t want to spin up new instances, too slow. But… Lambda!
#strangeloop@jessitron Ian Wilkes: Lambda scales up our compute, and they are 3x4 times as expensive as EC2, measured in CPU second. But of course, we don’t run it nearly as much. Like, a hundredth of the time.
#strangeloop@jessitron Ian Wilkes: Lambda is on-demand compute, but AWS didn’t build it for this. Because there is a burst concurrency limit.
#strangeloop@jessitron Ian Wilkes: Of course, we discovered this all using our own tool! We are using this concurrency operator, released last Thursday!
#strangeloop@jessitron Ian Wilkes: A Lambda can return up to 6Mb. Some of our queries can be over 20. Put that in S3. And use a modern compression algorithm.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
#DevOpsLoop@editingemily: DevOps built on the foundation of agile to become a default, a standard that we reach for to understand what we're doing and what we should be doing.
#DevOpsLoop@editingemily: When DevOps emerged, everything -- from the application to the deployment was centralized. We are finding novel solutions and unique takes to what we have accepted and been working around in DevOps.
#strangeloop@cristalopes: Modern conferences probably started in the renaissance, as a way for (rich, leisured) men to exchange knowledge, especially before fast and free printing.
#strangeloop@cristalopes: When conferences were part of academic life, they supplemented and promoted scholarly articles and knowledge sharing.
#TrajectoryCon@adrianco: In order to understand our trajetory, we need to understand where we are starting from and where we are going to.
#TrajectoryCon@adrianco: In the old world, if you made a door lock, you shaped a hunk of metal, shipped it, and never thought about it again. But now your lock calls you every five minutes, and if it doesn’t there’s a problem.
#srecon@randyshoup: Outage 1: Google App Engine Outage. App Engine was down globally for 8 hours. The playbook failed and triggered a cascading failure.
#srecon@randyshoup: Resolutions: increased traffic routing capacity, but more importantly, created a program to reduce probability of the same problem happening again.
#pycaribbean Jessie Hedges: Mental Health in Tech Shops
#pycaribbean Jessie Hedges: Tech has a problem — we are working hard even when we’re not working.
#pycaribbean Jessie Hedges: When we think of burnout, we think of toxic situations, poor culture, and poor management, but this talk is about normalized chronic stress caused by productive involvement.