My Authors
Read all threads
0/ Sometimes we should philosophize about observability… and sometimes we should just get ultra-pragmatic and examine real use cases from real systems!

Here is one about a bad deploy we had at @LightstepHQ the other day. Let’s get started with a picture…

Thread 👇
1/ In this example, we are contending with a failed deploy within Lightstep’s own (internal, multi-tenant) system. It was easy enough to *detect* the regression and roll back, but in order to fix the underlying issue, of course we had to understand it.
2/ We knew the failure was related to a bad deploy of the `liveview` service. The screenshot above shows `liveview` endpoints, ranked by the “biggest change” for the new release; at the top is “ExplorerService/Create” with a huge (!!) increase in error ratio.
3/ It’s worth noting that this dashboard was created automatically from aggregated span (i.e., tracing) data, and the “ExplorerService/Create” endpoint rose to the top automatically as well. There is no need to manually create, maintain, or stare at ad hoc dashboards.
4/ … but where do we go from here? This is when things used to get particularly painful – one would open up a bunch of dashboards, stare at logs, and start guessing-and-checking. Not good.

What if we could just click on the spike in error rate?? Let’s try it:
5/ In 99% of incidents (certainly including this one), the big question is “What Changed?!!”

Observability should directly answer that question. Here we see an entire view populated with color-coded data showing, well, “what changed” with respect to our error rate spike:
6/ Scrolling down the page, there are many avenues we can pursue to further understand this regression. The one that jumps out is an “InvalidArgument” tag that’s strongly correlated with our originating issue. Let’s click on that:
7/ Simply by grouping the regression transactions by `response_code`, we find the smoking gun: more than 98% of our error spike is due to these InvalidArgument responses!
8/ If we want to understand this in more detail, we can click on that row in the table to see (many) example transactions. They are all:

a) part of the original spike in `liveview` errors
b) exhibiting the InvalidArgument response code

… and there are 545 to choose from!
9/At this point we have very high confidence that the `InvalidArgument` responses caused our error spike. Observability gave us this confidence by analyzing many thousands of distributed traces, logs, and metrics, though we didn’t have to dig through any of that by hand.
10/ We can select any one of the trace examples from the table of `InvalidArgument` examples above, and immediately get our diagnosis. By automatically joining (transactional) logs with our traces, we see this error message:

“invalid - cannot have empty analyzer query”
11/ And that’s all our developer needed to understand what had changed with the new (bad) release. The next roll-forward was successful, and that was that.
12/ To recap our workflow: we began with the affected service, then simply clicked on whichever data seemed most relevant. And we never lost context, despite depending on (many thousands of) traces, metrics, and logs.

That’s how observability should be: unified and simple.
Missing some Tweet in this thread? You can try to force a refresh.

Keep Current with Ben Sigelman

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!