Building an event-driven, reliable serverless application is a difficult task ๐Ÿ‘จโ€๐Ÿ’ป

What's also challenging: monitoring your ever-growing ecosystem of functions ฦ›

My ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ฎ๐˜๐—ฒ ๐—ด๐˜‚๐—ถ๐—ฑ๐—ฒ ๐˜๐—ผ ๐—บ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ๐—ถ๐—ป๐—ด ๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ฒ๐—ฟ๐—น๐—ฒ๐˜€๐˜€ ๐—ฎ๐—ฝ๐—ฝ๐˜€
โ†“
๐—ง๐—ต๐—ฟ๐—ฒ๐—ฎ๐—ฑ ๐—ข๐˜ƒ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„ ๐Ÿงต

โ€ข Problem Statement
โ€ข What to Monitor?
โ€ข Performance Monitoring
โ€ข Costs & Usage
โ€ข Monitoring Tools
โ€ข Benefits of Serverless Monitoring

{ 1/28 }
Serverless architectures bring us a lot of known benefits:
โ€ข less operation overhead
โ€ข only paying for actually used resources
โ€ข reduced cycle times due to small, often independent deployment units
โ€ข instant scaling

... and much more.

{ 2/28 }
As for everything, there's not only the bright side but also some trade-offs, like setting up proper monitoring.

There are a lot more units to monitor, the life cycles are short & configuring agents directly contributes to latency and cost.

{ 3/28 }
Before digging into how to solve our monitoring dilemma, let's go one step back: what do we even ๐—ป๐—ฒ๐—ฒ๐—ฑ ๐˜๐—ผ ๐— ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ ๐—ณ๐—ผ๐—ฟ ๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ฒ๐—ฟ๐—น๐—ฒ๐˜€๐˜€ ๐—ฎ๐—ฝ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€?

For gaining maximum benefit of serverless: latency, cold starts, errors, cost & usage.

{ 4/28 }
๐—Ÿ๐—ฎ๐˜๐—ฒ๐—ป๐—ฐ๐˜† ๐— ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ๐—ถ๐—ป๐—ด

Large data sets can make it hard to notice a small performance drop for some user-facing function calls, as average metrics quickly hide outliers.

We need to keep an eye on mission-critical functions & observe for outliers.

{ 5/28 }
Regarding ๐—ฆervice ๐—Ÿayer ๐—”greements, we're often facing .๐Ÿต๐Ÿต requirements, which mean that 99% of requests can't exceed a given threshold.

Having a noticeable set of outliers can quickly burst through such requirements if not watched carefully.

{ 6/28 }
๐—–๐—ผ๐—น๐—ฑ ๐—ฆ๐˜๐—ฎ๐—ฟ๐˜๐˜€

If a function instance is provisioned, a new micro-container is started by AWS. This takes time and drastically increases the latency for this request.

Even worse: for a burst of parallel requests, there's a need for multiple containers.

{ 7/28 }
That's because a function instance is only able to compute one request at a time

It's important to track the number of cold starts so you can take architectural improvements if necessary, as there are a lot of possible measures to improve customer-facing cold-starts

{ 8/28 }
๐—œ๐—ป๐˜ƒ๐—ผ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—˜๐—ฟ๐—ฟ๐—ผ๐—ฟ๐˜€

There are a variety of reasons why a Lambda invocation can raise an error.
Such errors will return an HTTP 4xx or 5xx - so the invocation is rejected before the function receives it.

Surely, those are not the only possible problems.

{ 9/28 }
Outgoing calls to 3rd parties can fail without anybody noticing or rate limits are exceeded. Finding out what the actual bottleneck is can be difficult.

Notifications of failures & pinpointing where & when the error happened will save hours and reduce downtimes.

{ 10/28 }
๐—–๐—ผ๐˜€๐˜๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—จ๐˜€๐—ฎ๐—ด๐—ฒ

Your Lambda bill is built through three different factors:
โ€ข number of requests
โ€ข compute time
โ€ข provisioned memory

It drills down to ๐—š๐—•-๐˜€๐—ฒ๐—ฐ๐—ผ๐—ป๐—ฑ๐˜€, which means how many seconds a function with 1 GB allocated memory was running.

{ 11/28 }
The first bill is charged after 400,000 GB-seconds, as this is the free tier limit.

For this, you can run a 128MB function for the whole month without interruptions, but a 1GB function for less than 5 days.

{ 12/28 }
As you're generally not only using Lambda, there are also costs for other Services like S3, DynamoDB, or SNS.

While your app is growing rapidly, it's easy to lose track of what's been spent & what resources have been used efficiently and intentionally.

{ 13/28 }
๐— ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ๐—ถ๐—ป๐—ด ๐—ง๐—ผ๐—ผ๐—น๐˜€

Looking at the problem statement, it's easy to see that there's a need for tracing all of those areas. Running an app on blindsight won't work for a very long time.

Let's have a look at what AWS brings & how it compares to Dashbird.

{ 14/28 }
๐—”๐—ช๐—ฆ ๐—–๐—น๐—ผ๐˜‚๐—ฑ๐—ช๐—ฎ๐˜๐—ฐ๐—ต

What's already in the box: your functions logs are collected at streams in groups per function. Additionally, CloudWatch collects metrics that can also be collected in Dashboards.

Even more: you can set up alerts for metric alarms.

{ 15/28 }
With alarms, you'll be notified if predefined thresholds are exceeded.

CloudWatch is a good starting point for your first FaaS application. The more your landscape grows and the more request volume your app receives, you'll need a more comprehensive tool.

{ 16/28 }
Dashbird.io provides enhanced error alerting & observability for everything around AWS Lambda but doesn't affect performance or costs, as it gathers logs & metrics through AWS APIs

It starts by providing a great high-level overview of your app's health

{ 17/28 }
You can drill down into invocation level data to analyze individual functions.

Services that are closely related to Lambda and widely used are also covered: DynamoDB, SQS, API Gateway, Kineses, Step Functions & ECS.

{ 18/28 }
Furthermore, the Well-Architected Lens helps to find potential issues & implement best practices.

{ 19/28 }
๐—œ๐—ป๐˜€๐˜๐—ฎ๐—ป๐˜ ๐—•๐—ฒ๐—ป๐—ฒ๐—ณ๐—ถ๐˜๐˜€ ๐—ผ๐—ณ ๐—ฆ๐—ฒ๐—ฟ๐˜ƒ๐—ฒ๐—ฟ๐—น๐—ฒ๐˜€๐˜€ ๐— ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ๐—ถ๐—ป๐—ด

There are a lot of benefits at the first glance: you'll save a lot of time debugging and generally have a more productive business, team & application.

But that's not all.

{ 20/28 }
๐—œ๐˜€๐˜€๐˜‚๐—ฒ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ฎ๐—ป๐—ฑ ๐—ง๐—ฒ๐—ฎ๐—บ ๐—–๐—ผ๐—น๐—น๐—ฎ๐—ฏ๐—ผ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป

Regardless of how well your app is built, it will generate a reasonable amount of issues on a frequent basis.

Those issues need to be tracked, visualized, and managed in an efficient way.

{ 21/28 }
There needs to be a friendly way of displaying open, resolved, and temporarily muted issues so that the team collaborates better due to a clear way of communicating their resolution workflow.

{ 22/28 }
๐—ค๐˜‚๐—ฎ๐—น๐—ถ๐˜๐˜† ๐—ง๐—ฟ๐—ฎ๐—ฐ๐—ธ๐—ถ๐—ป๐—ด

A quick way of visualizing past occurrences of the same issue can be important as some cases require further investigation.

They may indicate that fixing approaches didn't work out as expected

It also helps to avoid the same mistakes

{ 23/28 }
๐—˜๐˜ƒ๐—ฒ๐—ป๐˜-๐——๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ป ๐——๐—ฒ๐—ฏ๐˜‚๐—ด๐—ด๐—ถ๐—ป๐—ด

Developers should not have the burden to only be proactive but rely on automated alerting. An automated alerting system may sound fundamental, but it's easy to miss relevant signals - especially when working with Lambda

{ 24/28 }
The alerting mechanism should not only detect app errors, but also infrastructure faults like timeouts, container crashes, memory exhaustion, and misconfigurations like incorrect access policies.

With the immense amount of logs, that's not a trivial task.

{ 25/28 }
For parts of the system that are more tolerant to faults, developers may disable individual issue alerting and set up aggregation metrics. This allows the attention to shift from development to debugging only when itโ€™s really required.

{ 26/28 }
๐—™๐—ฎ๐˜€๐˜ ๐—–๐—ผ๐—บ๐—บ๐˜‚๐—ป๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป

Lots of errors need to be fixed immediately, as they are significantly impacting the user experience. That's why developers need to be notified in a fast & convenient way
Most teams use a dedicated Slack channel for critical errors

{ 27/28 }
๐—ช๐—ฟ๐—ฎ๐—ฝ ๐˜‚๐—ฝ

As we've seen: there are a lot of reasons for having great monitoring. But furthermore, the most important fact is that it makes the developer's job easier and more enjoyable and also provides confidence in your app's reliability & frees up time!

{ 28/28 }
The complete article & more serverless related posts can be found at Dashbirds blog! โœ๏ธ

You're building a serverless SaaS product or already running one? ๐Ÿ‘‹
Register at @thedashbird to try it out for free or send me a message to book a free demo! ๐Ÿ‘ฉโ€๐Ÿ’ป

dashbird.io/blog/ultimate-โ€ฆ

โ€ข โ€ข โ€ข

Missing some Tweet in this thread? You can try to force a refresh
ใ€€

Keep Current with Tobias Schmidt

Tobias Schmidt Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tpschmidt_

10 Nov
๐Ÿ“šAWS 1x1 - ๐—–๐—น๐—ผ๐˜‚๐—ฑ๐—™๐—ฟ๐—ผ๐—ป๐˜

A low latency and high transfer speed content delivery network.

What does it offer? โ†“
๐—ง๐—ต๐—ฟ๐—ฒ๐—ฎ๐—ฑ ๐—ข๐˜ƒ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„ ๐Ÿงต

โ€ข Introduction
โ€ข Distributions
โ€ข Origins
โ€ข Edge Behaviors
โ€ข Geo-Restrictions
โ€ข Edge Computing
โ€ข Pricing & Free Tier

{ 1/14 }
๐—œ๐—ป๐˜๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป

CloudFront is a ๐—–ontent ๐——elivery ๐—กetwork: a globally distributed set of caching servers that can store content returned by your origin servers that enable fast & low latency requests to your content around the globe.

{ 2/14 }
Read 16 tweets
9 Nov
๐Ÿ“š AWS 1x1 - ๐—˜๐—–๐—ฆ

Easily run, stop, and manage containers in the cloud.

All you need to know about tasks, task definitions, clusters & containers โ†“
๐—ง๐—ต๐—ฟ๐—ฒ๐—ฎ๐—ฑ ๐—ข๐˜ƒ๐—ฒ๐—ฟ๐—ถ๐—ฒ๐˜„ ๐Ÿงต

โ€ข About Docker
โ€ข Fundamentals
โ€ข Task Definitions
โ€ข Tasks
โ€ข Services
โ€ข Clusters
โ€ข Container Instances vs. Fargate

{ 1/17 }
๐——๐—ผ๐—ฐ๐—ธ๐—ฒ๐—ฟ

Before getting started with ECS, you need to understand Docker, because it's one of the basic building blocks.

Docker helps to create environments to run your application, regardless of the underlying operating system.

{ 2/17 }
Read 19 tweets
5 Nov
๐Ÿ“š AWS 1x1 - ๐—ฉ๐—ฃ๐—– & ๐—ก๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ๐—ถ๐—ป๐—ด

Your logically isolated virtual network in the cloud.

From Security Groups, over Route Tables to VPC Peering โ†“
๐—ง๐—ต๐—ฟ๐—ฒ๐—ฎ๐—ฑ ๐—ข๐˜ƒ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„ ๐Ÿงต

โ€ข VPCs & Subnets
โ€ข Route Tables
โ€ข Internet Gateway
โ€ข NAT Gateways & Instances
โ€ข Security Groups
โ€ข Network Access Control Lists
โ€ข VPC Peering

{ 1/14 }
Maybe you didn't know, but Amazon ๐—ฉirtual ๐—ฃrivate ๐—กetwork is the networking layer for EC2.

This virtual network imitates your local data center, but with all the benefits of the cloud's scalable infrastructure.

Knowing about VPC & networking is crucial.

{ 2/14 }
Read 16 tweets
4 Nov
๐Ÿ“š AWS 1x1 - ๐— ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ๐—ถ๐—ป๐—ด & ๐—”๐˜‚๐—ฑ๐—ถ๐˜๐—ถ๐—ป๐—ด ๐—Ÿ๐—ฎ๐—บ๐—ฏ๐—ฑ๐—ฎ

There's a lot that comes out of the box to gain insights into how well your serverless app is performing

A quick overview to get you started โ†“
1๏ธโƒฃ Amazon CloudWatch

CloudWatch automatically monitors your functions on your behalf. It reports a lot of useful metrics:

โ€ข number of invocations
โ€ข execution durations
โ€ข occurred errors
โ€ข function throttles

Everything is exposed on a function level!
2๏ธโƒฃ Amazon CloudTrail

CloudTrail offers you governance, compliance & auditing features for several services, including Lambda.
It enables you to log all (encryption supported!) actions taken regarding your infrastructure, regardless if it's via the console UI or AWS SDK!
Read 6 tweets
3 Nov
๐Ÿ“š ๐—”๐—ช๐—ฆ ๐Ÿญ๐˜…๐Ÿญ - ๐—ฆ๐—ค๐—ฆ

Your fully managed message queue service & a serverless fan's best friend.

From queue types, over visibility timeouts to message groups โ†“
๐—ง๐—ต๐—ฟ๐—ฒ๐—ฎ๐—ฑ ๐—ข๐˜ƒ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„ ๐Ÿงต

โ€ข Introduction
โ€ข Importance of Messaging Systems
โ€ข Fundamentals
โ€ข Queue Types
โ€ข Visibility Timeouts
โ€ข Retention Periods
โ€ข Limitations

{ 1/22 }
๐—œ๐—ป๐˜๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป

Believe it or not: SQS was the ๐—ณ๐—ถ๐—ฟ๐˜€๐˜ publicly launched service by AWS!

Quoting Jeff Bar:
"We launched the Simple Queue Service in ๐—น๐—ฎ๐˜๐—ฒ ๐Ÿฎ๐Ÿฌ๐Ÿฌ๐Ÿฐ, Amazon S3 in early 2006, and Amazon EC2 later that summer."

jeff-barr.com/2014/08/19/my-โ€ฆ

{ 2/22 }
Read 24 tweets
2 Nov
Thanks for all your interest in my AWS 1x1 threads! ๐Ÿ“š ๐Ÿ‘‹

The good news: ๐˜๐—ต๐—ฒ๐—ฟ๐—ฒ'๐˜€ ๐—ฎ ๐—น๐—ผ๐˜ ๐—บ๐—ผ๐—ฟ๐—ฒ ๐—ถ๐—ป ๐˜๐—ต๐—ฒ ๐—ฝ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ!
... also for Azure ๐Ÿ’™

Didn't see the previous ones yet?
๐—Ÿ๐—ถ๐—ป๐—ธ๐˜€ ๐˜๐—ผ ๐—ฎ๐—น๐—น ๐—บ๐˜† ๐—ฟ๐—ฒ๐—ฐ๐—ฒ๐—ป๐˜ ๐—ฝ๐—ผ๐˜€๐˜๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐—ฏ๐—ฒ๐—น๐—ผ๐˜„ โ†“
1๏ธโƒฃ ๐—›๐—ผ๐˜„ ๐˜๐—ผ ๐—ด๐—ฒ๐˜ ๐˜€๐˜๐—ฎ๐—ฟ๐˜๐—ฒ๐—ฑ ๐˜„๐—ถ๐˜๐—ต ๐—”๐—ช๐—ฆ

2๏ธโƒฃ ๐—ก๐—ผ๐˜ ๐—ณ๐—ฒ๐—ฎ๐—ฟ๐—ถ๐—ป๐—ด ๐—–๐—ผ๐˜€๐˜๐˜€ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—–๐—น๐—ผ๐˜‚๐—ฑ

Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Thank you for your support!

Follow Us on Twitter!

:(