Tweet

Rajeev N Bharshetty

Mar 1 • 15 tweets • 3 min read

Got some very good insights on Monitoring and Instrumentation from this video:
Summarising some key takeaways here: 🧵

Monitoring Philosophy: Monitoring is a way to measure your customer experience. It is about looking from the lens of your customer to empathize with what your customers experience.

Measuring "Above the fold" latency: Measure the latency characteristics of the top of the page for websites like amazon.com. "Above the fold" latency provides a better picture of what most of your customers experience when they visit your website.

The cycle of Monitoring: Monitoring needs to be thought of as a cycle that needs to be improved continuously. It starts with asking what needs to be monitored, followed by instrumentation, metric aggregation, and dashboarding. Rinse and repeat.

Noisy alarms: Identifying the right metrics to be alerted on is very critical. False alerts or missing alerts are both bad. False alerts cause the broken window syndrome in the team, causing the team to fail to respond when there are critical issues.

Separating Signal from Noise
Ex: Query latency metrics from Amazon DynamoDB. Due to different characteristics of the incoming queries, some taking longer time than the others, the metrics saw latency spikes in the dashboard, causing alerts to fire whenever longer queries ran.

Separating Signal from Noise: One way to resolve this issue is to identify and segregate the faster queries from the slower ones and track the latencies separately. This segregation makes sense when you are expecting both faster and slower queries.

Server vs. Client Faults: Most of the time, client errors should not be alerted. Client errors (4xx) are triggered by the user failing to meet specific validation criteria. Differentiating server and Client errors help avoid noisy alerts due to client errors.

Server vs. Client Faults: There are some cases for which you might want to get notified/alerted on client errors. For example: when your users cannot sign-up to your website due to a changed validation rule for the username field. You want to be notified of such failures.

Server vs. Client Faults: To get notified in the above case, setting up business alerts on the signup funnel is good. # of users who landed on the signup page and # of users who failed to sign up. This can help us with the identification of drop-off reasons.

Distributed tracing: Helps with pin-pointing the component with an issue in a distributed microservice architecture.

Latency distribution is essential to identify the pxx latencies, providing a better picture of your customers' experience than average latencies.

Success vs. Error Latency: Usually, the latency of your application improves when your application starts returning errors. Separating success and error latencies can help understand the latency changes when there are issues in your system.

Refactoring your monitoring system is very important. If the metrics are not instrumented correctly, your metrics can lie on the actual state of your system. Asking the right questions and continuously improving your monitoring and instrumentation is critical.

Remember, #Monitoring is a cycle.

/end

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Rajeev N Bharshetty

People who liked this thread also liked...

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?