โข Problem Statement
โข What to Monitor?
โข Performance Monitoring
โข Costs & Usage
โข Monitoring Tools
โข Benefits of Serverless Monitoring
{ 1/28 }
Serverless architectures bring us a lot of known benefits:
โข less operation overhead
โข only paying for actually used resources
โข reduced cycle times due to small, often independent deployment units
โข instant scaling
... and much more.
{ 2/28 }
As for everything, there's not only the bright side but also some trade-offs, like setting up proper monitoring.
There are a lot more units to monitor, the life cycles are short & configuring agents directly contributes to latency and cost.
{ 3/28 }
Before digging into how to solve our monitoring dilemma, let's go one step back: what do we even ๐ป๐ฒ๐ฒ๐ฑ ๐๐ผ ๐ ๐ผ๐ป๐ถ๐๐ผ๐ฟ ๐ณ๐ผ๐ฟ ๐๐ฒ๐ฟ๐๐ฒ๐ฟ๐น๐ฒ๐๐ ๐ฎ๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป๐?
For gaining maximum benefit of serverless: latency, cold starts, errors, cost & usage.
Large data sets can make it hard to notice a small performance drop for some user-facing function calls, as average metrics quickly hide outliers.
We need to keep an eye on mission-critical functions & observe for outliers.
{ 5/28 }
Regarding ๐ฆervice ๐ayer ๐greements, we're often facing .๐ต๐ต requirements, which mean that 99% of requests can't exceed a given threshold.
Having a noticeable set of outliers can quickly burst through such requirements if not watched carefully.
{ 6/28 }
๐๐ผ๐น๐ฑ ๐ฆ๐๐ฎ๐ฟ๐๐
If a function instance is provisioned, a new micro-container is started by AWS. This takes time and drastically increases the latency for this request.
Even worse: for a burst of parallel requests, there's a need for multiple containers.
{ 7/28 }
That's because a function instance is only able to compute one request at a time
It's important to track the number of cold starts so you can take architectural improvements if necessary, as there are a lot of possible measures to improve customer-facing cold-starts
There are a variety of reasons why a Lambda invocation can raise an error.
Such errors will return an HTTP 4xx or 5xx - so the invocation is rejected before the function receives it.
Surely, those are not the only possible problems.
{ 9/28 }
Outgoing calls to 3rd parties can fail without anybody noticing or rate limits are exceeded. Finding out what the actual bottleneck is can be difficult.
Notifications of failures & pinpointing where & when the error happened will save hours and reduce downtimes.
Looking at the problem statement, it's easy to see that there's a need for tracing all of those areas. Running an app on blindsight won't work for a very long time.
Let's have a look at what AWS brings & how it compares to Dashbird.
What's already in the box: your functions logs are collected at streams in groups per function. Additionally, CloudWatch collects metrics that can also be collected in Dashboards.
Even more: you can set up alerts for metric alarms.
{ 15/28 }
With alarms, you'll be notified if predefined thresholds are exceeded.
CloudWatch is a good starting point for your first FaaS application. The more your landscape grows and the more request volume your app receives, you'll need a more comprehensive tool.
{ 16/28 }
Dashbird.io provides enhanced error alerting & observability for everything around AWS Lambda but doesn't affect performance or costs, as it gathers logs & metrics through AWS APIs
It starts by providing a great high-level overview of your app's health
{ 17/28 }
You can drill down into invocation level data to analyze individual functions.
Services that are closely related to Lambda and widely used are also covered: DynamoDB, SQS, API Gateway, Kineses, Step Functions & ECS.
{ 18/28 }
Furthermore, the Well-Architected Lens helps to find potential issues & implement best practices.
There are a lot of benefits at the first glance: you'll save a lot of time debugging and generally have a more productive business, team & application.
Regardless of how well your app is built, it will generate a reasonable amount of issues on a frequent basis.
Those issues need to be tracked, visualized, and managed in an efficient way.
{ 21/28 }
There needs to be a friendly way of displaying open, resolved, and temporarily muted issues so that the team collaborates better due to a clear way of communicating their resolution workflow.
Developers should not have the burden to only be proactive but rely on automated alerting. An automated alerting system may sound fundamental, but it's easy to miss relevant signals - especially when working with Lambda
{ 24/28 }
The alerting mechanism should not only detect app errors, but also infrastructure faults like timeouts, container crashes, memory exhaustion, and misconfigurations like incorrect access policies.
With the immense amount of logs, that's not a trivial task.
{ 25/28 }
For parts of the system that are more tolerant to faults, developers may disable individual issue alerting and set up aggregation metrics. This allows the attention to shift from development to debugging only when itโs really required.
Lots of errors need to be fixed immediately, as they are significantly impacting the user experience. That's why developers need to be notified in a fast & convenient way
Most teams use a dedicated Slack channel for critical errors
{ 27/28 }
๐ช๐ฟ๐ฎ๐ฝ ๐๐ฝ
As we've seen: there are a lot of reasons for having great monitoring. But furthermore, the most important fact is that it makes the developer's job easier and more enjoyable and also provides confidence in your app's reliability & frees up time!
{ 28/28 }
The complete article & more serverless related posts can be found at Dashbirds blog! โ๏ธ
You're building a serverless SaaS product or already running one? ๐
Register at @thedashbird to try it out for free or send me a message to book a free demo! ๐ฉโ๐ป
CloudFront is a ๐ontent ๐elivery ๐กetwork: a globally distributed set of caching servers that can store content returned by your origin servers that enable fast & low latency requests to your content around the globe.
There's a lot that comes out of the box to gain insights into how well your serverless app is performing
A quick overview to get you started โ
1๏ธโฃ Amazon CloudWatch
CloudWatch automatically monitors your functions on your behalf. It reports a lot of useful metrics:
โข number of invocations
โข execution durations
โข occurred errors
โข function throttles
Everything is exposed on a function level!
2๏ธโฃ Amazon CloudTrail
CloudTrail offers you governance, compliance & auditing features for several services, including Lambda.
It enables you to log all (encryption supported!) actions taken regarding your infrastructure, regardless if it's via the console UI or AWS SDK!
โข Introduction
โข Importance of Messaging Systems
โข Fundamentals
โข Queue Types
โข Visibility Timeouts
โข Retention Periods
โข Limitations
{ 1/22 }
๐๐ป๐๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป
Believe it or not: SQS was the ๐ณ๐ถ๐ฟ๐๐ publicly launched service by AWS!
Quoting Jeff Bar:
"We launched the Simple Queue Service in ๐น๐ฎ๐๐ฒ ๐ฎ๐ฌ๐ฌ๐ฐ, Amazon S3 in early 2006, and Amazon EC2 later that summer."
Thanks for all your interest in my AWS 1x1 threads! ๐ ๐
The good news: ๐๐ต๐ฒ๐ฟ๐ฒ'๐ ๐ฎ ๐น๐ผ๐ ๐บ๐ผ๐ฟ๐ฒ ๐ถ๐ป ๐๐ต๐ฒ ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ!
... also for Azure ๐
Didn't see the previous ones yet?
๐๐ถ๐ป๐ธ๐ ๐๐ผ ๐ฎ๐น๐น ๐บ๐ ๐ฟ๐ฒ๐ฐ๐ฒ๐ป๐ ๐ฝ๐ผ๐๐๐ ๐ฎ๐ฟ๐ฒ ๐ฏ๐ฒ๐น๐ผ๐ โ