Got some very good insights on Monitoring and Instrumentation from this video:
Summarising some key takeaways here: 🧵
Monitoring Philosophy: Monitoring is a way to measure your customer experience. It is about looking from the lens of your customer to empathize with what your customers experience.
Measuring "Above the fold" latency: Measure the latency characteristics of the top of the page for websites like amazon.com. "Above the fold" latency provides a better picture of what most of your customers experience when they visit your website.
The cycle of Monitoring: Monitoring needs to be thought of as a cycle that needs to be improved continuously. It starts with asking what needs to be monitored, followed by instrumentation, metric aggregation, and dashboarding. Rinse and repeat.
Noisy alarms: Identifying the right metrics to be alerted on is very critical. False alerts or missing alerts are both bad. False alerts cause the broken window syndrome in the team, causing the team to fail to respond when there are critical issues.
Separating Signal from Noise
Ex: Query latency metrics from Amazon DynamoDB. Due to different characteristics of the incoming queries, some taking longer time than the others, the metrics saw latency spikes in the dashboard, causing alerts to fire whenever longer queries ran.
Separating Signal from Noise: One way to resolve this issue is to identify and segregate the faster queries from the slower ones and track the latencies separately. This segregation makes sense when you are expecting both faster and slower queries.
Server vs. Client Faults: Most of the time, client errors should not be alerted. Client errors (4xx) are triggered by the user failing to meet specific validation criteria. Differentiating server and Client errors help avoid noisy alerts due to client errors.
Server vs. Client Faults: There are some cases for which you might want to get notified/alerted on client errors. For example: when your users cannot sign-up to your website due to a changed validation rule for the username field. You want to be notified of such failures.
Server vs. Client Faults: To get notified in the above case, setting up business alerts on the signup funnel is good. # of users who landed on the signup page and # of users who failed to sign up. This can help us with the identification of drop-off reasons.
Distributed tracing: Helps with pin-pointing the component with an issue in a distributed microservice architecture.
Latency distribution is essential to identify the pxx latencies, providing a better picture of your customers' experience than average latencies.
Success vs. Error Latency: Usually, the latency of your application improves when your application starts returning errors. Separating success and error latencies can help understand the latency changes when there are issues in your system.
Refactoring your monitoring system is very important. If the metrics are not instrumented correctly, your metrics can lie on the actual state of your system. Asking the right questions and continuously improving your monitoring and instrumentation is critical.
Remember, #Monitoring is a cycle.

/end

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rajeev N Bharshetty

Rajeev N Bharshetty Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(