#strangeloop @frankc: How Tracing Uncovers Half-Truths in Slack’s CI Infrastructure
#strangeloop @frankc: Heirarchy of needs: bottom — observability, middle — resilience, top — [too slow]
#strangeloop @frankc: Slack has grown a lot, very fast over 6 years. How do we create better engineering tools for creating, testing, deploying code and features.
#strangeloop @frankc: Slack has 60 people on internal tooling teams. User experiences in CI/CD are across multiple days, platforms, and workflows.
#strangeloop @frankc: You can create massive business results when teams are sharing a language and reasoning about problems using that shared language. How do we create a shared infrastructure “language”.
#strangeloop @frankc: Why use distributed tracing? “It’s slow” is the hardest problem to debug in distributed systems. “It’s flaky” is the problem most infrastructure teams struggle with.
#strangeloop @frankc: To reason about this, we need to create an event that contains context — a span. Hello, @honeycomb_io!
#strangeloop @frankc: Slack uses an internal CI/CD tool to bridge different systems that are involved in build, deployment, testing, etc.
#strangeloop @frankc: Cardinality for CI trases is very different from other use cases, because they have lower volume, and high criticality. You don’t need sampling, but you do need to address almost all the faults.
#strangeloop @frankc: [this talk is extremely dense, so if you want a better view of the content, download it from the StrangeLoop site tomorrow]
#strangeloop @frankc: In 2019, we realized that tehre was a variance in unexpected areas. We built a hypothesis/testing observability tool to see if it was going to be useful to address observability problems.
#strangeloop @frankc: Cross-service tracing in CI, start with a story of a bad day called “Jenkins Queue Large and Cursed”. We added tracing to try to find the problem. We found something previously uninstrumented and could address the real problem.
#strangeloop @frankc: Agreeing on dimensions that are useful to one or many teams and making those visible allows teams to reuse dimensions that might be useful to them.
#strangeloop @frankc: We worked with engineers to instrument their platforms, and then we could see what was slowing them down. We used a hypothesis to see if we needed to revert or change the things we were tweaking, but that was for confindence, we didn’t need it.
#strangeloop @frankc: Circuit breaker pattern used to gate test behavior! Fascinating. When the circuit breaker is open, the tests are deferred because they would come back flaky anyway. The tests go back on when we stabilize the build.
#strangeloop @frankc: I love that we are, as an industry, talking about sociotechnical systems as part of everything, but especially observability.
#strangeloop @frankc: ”I’d love to take questions …. except about DNS”. [context — Slack had an outage caused by DNS this week]
#strangeloop @frankc: In response to a question, we are now diving through a Honeycomb query. 🐝 💖
#strangeloop @jessitron Ian Wilkes: How we used serverless to speed up our servers
#strangeloop @jessitron Ian Wilkes: Oh, cool! one of the presenters on stage speaking, one of them is in the chat channel answering questions.
#strangeloop @jessitron Ian Wilkes: What is AWS Lambda good at, and what makes it painful?
#strangeloop @jessitron Ian Wilkes: Honeycomb has specific strengths, and therefore specific needs. Honeycomb is for observability: being able to find out what’s going on in your system, in production, that you didn’t know you needed to know.
#strangeloop @jessitron Ian Wilkes: Monitoring is scar tissue of things you know you’ve already had failures in. Observability is being able to ask questions in real-time about surprising events.
#strangeloop @jessitron Ian Wilkes: I decided to look at our lambdas, and I wanted to find out why all of them seem so spiky, and figure out what’s different about the slow responses?
#strangeloop @jessitron Ian Wilkes: Honeycomb allows rapid-fire interactive investigation of production behavior. But the query has to be fast to be interactive. So our datastore has to respond very quickly to un-indexed, un-pre-aggragated data. We just have the raw data.
#strangeloop @jessitron Ian Wilkes: How does Retriever do that? It’s a distributed datastore, and then the query fans out to all the disks that have data, and those are read and aggregated and sent back to the querier originator.
#strangeloop @jessitron Ian Wilkes: Every field in the event is a column, and every event has a time stamp, so they can be linked together. But when the query comes in, you only have to read the columns that are being queried.
#strangeloop @jessitron Ian Wilkes: The dynamic aggregation of any fields across any time range. We need an aggregator that can operate even faster, and with bigger data. Because we’re dealing with much bigger volumes of data.
#strangeloop @jessitron Ian Wilkes: We can’t be having these queries take a long time. We don’t need people wandering off. So we need to increase our compute so that we can handle these large, gnarly queries. We don’t want to spin up new instances, too slow. But… Lambda!
#strangeloop @jessitron Ian Wilkes: Lambda scales up our compute, and they are 3x4 times as expensive as EC2, measured in CPU second. But of course, we don’t run it nearly as much. Like, a hundredth of the time.
#strangeloop @jessitron Ian Wilkes: Lambda is on-demand compute, but AWS didn’t build it for this. Because there is a burst concurrency limit.
#strangeloop @jessitron Ian Wilkes: Of course, we discovered this all using our own tool! We are using this concurrency operator, released last Thursday!
#strangeloop @jessitron Ian Wilkes: A Lambda can return up to 6Mb. Some of our queries can be over 20. Put that in S3. And use a modern compression algorithm.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Heidi hopes you are ok

Heidi hopes you are ok Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @wiredferret

4 Oct
#DevOpsLoop @editingemily: Rethinking the SDLC
#DevOpsLoop @editingemily: DevOps built on the foundation of agile to become a default, a standard that we reach for to understand what we're doing and what we should be doing.
#DevOpsLoop @editingemily: When DevOps emerged, everything -- from the application to the deployment was centralized. We are finding novel solutions and unique takes to what we have accepted and been working around in DevOps.
Read 38 tweets
1 Oct
#strangeloop @cristalopes: The Future of Conferences — Crista Lopes
#strangeloop @cristalopes: Modern conferences probably started in the renaissance, as a way for (rich, leisured) men to exchange knowledge, especially before fast and free printing.
#strangeloop @cristalopes: When conferences were part of academic life, they supplemented and promoted scholarly articles and knowledge sharing.
Read 27 tweets
1 Jul 20
A modest proposal:
No one designing an app interface can view it on anything newer than an iPhone 6.

People designing web applications and sites get a 17-inch monitor at least 5 years old.
The rest of my platform:
All performance testing to be conducted on internet found in rural America.

Every 2 weeks, your mouse and touchpad vanish for a day.

Any strings not prepared for localization to be rendered in Wingdings.
All office coffee pots will only dispense 8 ounces in the time it takes a page to render on 2G.

Chatbots will take up the same proportional space on the screen, with the same frequency, for developers as users.

Some really common two-letter code will be replaced by a pick list.
Read 4 tweets
9 Apr 19
#TrajectoryCon @adrianco: Measuring Your Trajectory
#TrajectoryCon @adrianco: In order to understand our trajetory, we need to understand where we are starting from and where we are going to.
#TrajectoryCon @adrianco: In the old world, if you made a door lock, you shaped a hunk of metal, shipped it, and never thought about it again. But now your lock calls you every five minutes, and if it doesn’t there’s a problem.
Read 41 tweets
27 Mar 19
#srecon @randyshoup: Learning from Learnings: Anatomy of Three Incidents
#srecon @randyshoup: Outage 1: Google App Engine Outage. App Engine was down globally for 8 hours. The playbook failed and triggered a cascading failure.
#srecon @randyshoup: Resolutions: increased traffic routing capacity, but more importantly, created a program to reduce probability of the same problem happening again.
Read 28 tweets
16 Feb 19
#pycaribbean Jessie Hedges: Mental Health in Tech Shops
#pycaribbean Jessie Hedges: Tech has a problem — we are working hard even when we’re not working.
#pycaribbean Jessie Hedges: When we think of burnout, we think of toxic situations, poor culture, and poor management, but this talk is about normalized chronic stress caused by productive involvement.
Read 20 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(