re:Invent starts tomorrow, so let me round up the biggest #serverless related announcements from the last 2 weeks (I know, crazy!) and share a few thoughts on what they mean for you.
Mega 🧵If this gets 168 retweets then I'll add another batch!
1. Lambda released Logs API which works with the Lambda Extensions mechanism that was released in Oct. This lets you subscribe to Lambda logs in a Lambda extension and ship them elsewhere WITHOUT going through CloudWatch Logs
a. it lets you side-step CloudWatch Logs, which often costs more (sometimes 10x more) than Lambda invocations in production apps.
b. it's possible (although not really feasible right now) to ship logs in real-time
You can disable the Lambda permissions for shipping logs to CW Logs so they're never ingested by CW Logs (and so you don't pay for CW Logs), and use an extension to ship the logs to a log aggregation service (e.g. loggly, logz.io) instead.
There was a bunch of AWS launch partners with official extensions to take advantage of this capability. I expect to see the ecosystem grow around this in the coming months.
The shipping logs in real-time aspect is trickier. Because the invocation is not finished until all the extensions are done. So if an extension is busy waiting for logs and shipping them then that adds to the function duration and API latency. So it's not really feasible yet.
What we need is a way to let the function respond while giving the extensions the chance to carry on with some background task, maybe for another X ms. Not sure how possible this is, and how that impacts the billing model though.
2. Lambda announced code signing support through AWS Signer. And Lambda would check the signature to make sure the code hasn't been tempered with.
This is probably not something most of you would care about. But in some regulated environments, this might be required to tick a box for audit and compliance requirements.
Honestly, I'm glad this is nothing that I've ever had to do! 😅
3. Lambda added support for AVX2 so you can build your app to target the AVX2 instruction set (lets CPU do more calculations per CPU cycle).
This is very relevant for those of you who are using Lambda as a cheap, on-demand Supercomputer for scientific computing, monte carlo simulations, or any other CPU-intensive tasks. @allPowerde and her work at @CSIRO springs to mind (analyzing SARS-COV2 genomes)
I don't think it'll make too much difference to your run-of-the-mill REST APIs. Although maybe you can save a few ms if you need to parse large JSON objects and the extra CPU power can help. And since power == memory == 💰💰 in Lambda, it might save you some 💰💰 too
That being said, until you've benchmarked the difference and proved that it's worth doing, plz don't make your life more complicated than it needs to be.
It's really easy to spend $10000 in eng effort to save $10/month with Lambda...
Sometimes, good enough is good enough.
4. Lambda supports batch window of up to 5 mins for SQS. This lets Lambda batch up SQS messages before invoking your function, so you can now process up to 10k SQS messages in one invocation (instead of the default 10 with SQS).
This is great for really high throughput systems, where it'd have otherwise required A LOT more Lambda invocations to process the messages. So you can save on Lambda invocation cost. IF the tasks are not time-sensitive, that is.
There are caveats to consider though.
Messages can be buffered for up to 5 mins before being handed to your function for processing, which might take longer than before because it's now processing a bigger batch than 10. So adjust the visibility timeout accordingly.
AWS used to recommend using 6x the Lambda timeout for the SQS visibility timeout, I suspect this needs to be updated if you're using a batch window, maybe it needs to be 6x Lambda timeout + MaximumBatchingWindowInSeconds
You know how partial failures can be tricky? Well, before you had a batch of up to 10 messages, now you can have a batch of up to 10,000, so partial failures is gonna be even trickier to deal with...
One strategy was to call DeleteMessageBatch on successful messages yourself and then bubble up the error (see this post: lumigo.io/blog/sqs-and-l…)
Now imagine, you have 1/10000 failure, so you need to call DeleteMessageBatch for 9999 messages, in batches of 10 (SQS limit)...
Would that even work? You'll probably get throttled, or at the very least, you need a long timeout for the function to cater for "extra time it takes to call DeleteMessageBatch on 9999 messages".
In practice, this edge case might never transpire though.
I hope...
Maybe in light of this, it's better to swallow the error and publish the failed messages to a separate queue for the cast-aways where they get retried a few more times (without the batch window) before being DLQ'd.
5. EventBridge supports server-side encryption (SSE) and upped default throughput to 10,000 events/s in us-east-1, us-west-2 and eu-west-1.
I gotta admit, I thought EventBridge had SSE support already!
And looks like it doesn't support customer-managed key (CMKs) yet, which might be a showstopper for regulated environments where the customer has to import their own key materials.
And it has the same fallacies as SSE-S3 where you don't need permissions for the KMS key to access the data.
That being said, I'm not sure how CMKs would even work in a push-based service... can't think of another push-based AWS service that does SSE with CMKs... 🤔
On the other hand, the rate limit raise is definitely good news.
PS. it's still a soft limit, so you can raise it further if you need to, just pop into the Service Quotas console and raise a request.
6. Step Functions supports sync workflow executions (Express Workflows only). This means you can finally have API Gateway trigger a state machine execution and get the response back without having to poll for it.
This is gonna open up a lot of questions around "should I use a Lambda function or model the biz logic as a state machine in Step Functions?" Which deserves a whole separate blog post to analyze the pros and cons. But I suspect 90% of cases you'd still be better off with Lambda
But there's some really cool stuff you can do with Step Functions, like using a callback pattern to wait for a human response or using a Map state to handle map-reduce type workloads, or throw in a good old saga for distributed transactions read.acloud.guru/how-the-saga-p…
7. Step Functions adds service integration with API Gateway. This lets you call an API Gateway endpoint (and sign the request with sigv4) without having to use a Lambda function.
This is really significant, more and more folks are running a centralised event bus in a separate account, and subscribing and pushing events to this centralised bus was painful and required touching both accounts (for the event bus as well as the event consumer/publisher)
I need to try this out for myself to form an opinion about how well it actually works. But from the docs, it looks like it *should work the way you expect it to.
More on this later.
9. DynamoDB adds support for PartiQL.
I'm afraid this sounds like it's a much bigger deal than it actually is...
First of all, it doesn't change DynamoDB fundamentally, it hasn't introduced any new capability beyond "a different syntax to query your data" (albeit a syntax that you might be more comfortable with). It doesn't eliminate any of the limitations on how you can query your data.
In fact, there are some dark caveats to look out for.
It will transparently switch from a Query to a Scan, and we say "friends don't let friends do DynamoDB scans" for a reason - it's expensive, inefficient and SLOOOOOWWW.
And I get that it comes from a good place - to make it easier for folks coming from SQL to adopt DynamoDB. But I feel this is like handing a loaded gun to a baby and pointing it at their feet... it's gonna be so easy to do the wrong thing with this...
And I think it also just delays the point when someone realises "I need to pick up @alexbdebrie 's book and learn DynamoDB", maybe to never! Because they just need to shoot their foot once to run back to RDBMS, and they're not coming back this time...
anyway... let's move onto something more positive
10. (I feel like I've been typing for ever, can't believe it's only 10!) DynamoDB adds support for Kinesis streams, in addition to the existing DynamoDB stream
The Good:
✅ can use one stream to capture updates from multiple tables
✅ can use Kinesis's direct integrations - e.g. Kinesis Analytics, Firehose
✅ more control around sharding and throughput
✅ extensible retention days
✅ more than 2 consumers
The Bad:
❎ the sequence of records is not guaranteed
❎ you can get duplicated events
The official doc has a nice side-by-side comparison of the two streaming options
However, this might not mean much in practice for a lot of you because your Lambda bill is $5/month, so saving even 50% only buys you a cup of Starbucks coffee a month.
Given all the excitement over Lambda's per-ms billing change today, some of you might be thinking how much money you can save by shaving 10ms off your function.
Fight that temptation 🧘♂️until you can prove the ROI on doing the optimization.
Assuming $50 (which is VERY conservative) per dev per hour, it would have taken them 40 months to break even on just having the meeting, before writing a single line of code!
With the per-ms billing, you're automatically saving on your Lambda cost already, by NOT having your invocation time rounded up to the next 100ms.
Unless you're invoking a function at such high frequency, those micro-optimizations won't be worth the eng time you have to invest.
I sat down this weekend and had a look at my finances as I'm almost 5 months into my 2nd year as a full-time solo consultant, and noticed that my revenue streams have changed quite a bit over the last 3 years.
This is the result of a conscious effort to reduce my reliance on a few large clients, and also to offset seasonalities and other factors that can affect revenue and create a healthy mix of active and passive income streams.
Overall revenue has grown over time, and my largest client now accounts for less than 20% of my revenue. And I haven't seen too much seasonality to my work yet - summer was quieter because Europeans went on holiday, but it was still OK.
X: in light of last week's #AWS outage, should I make my app multi-region?
me: it depends.
X: on what?
me: how much did the outage cost you in lost sales, reputation cost, etc.? And how much are you willing to invest in improving your uptime in case of another region-wide outage?
X: erm... I'm not sure...
me: don't get me wrong, if you're a large enterprise, I expect you to be multi-region already! Hell, I expect you to be doing chaos engineering and proactively finding weaknesses in your architecture before disasters strike and force you into reacting.
me: but as we can see from these AWS outages, modern systems are complex, and even companies who like AWS who have invested heavily into resilience and are doing all the right things, 💩still happens
X: when would you NOT use #AppSync?
me: since AppSync gives you managed #GraphQL server as a service, so if you need a REST API instead then you won't use AppSync. Also, you wouldn't use AppSync if you need GraphQL/Apollo features that are not supported by AppSync
X: what sorta features are you talking about?
me: you can't define custom scalar types (e.g. LatLon is a popular one), and implementation-specific features like Apollo federations for schema stitching, or utilities like data loaders github.com/graphql/datalo…
X: ok, do you need them to build a product app?
me: no, you can absolutely build production apps without them, but these features can be very useful in some contexts, for example, Netflix uses federation heavily netflixtechblog.com/how-netflix-sc…
X: what's your opinion on VTL templates vs direct lambda invocations with AppSync?
me: you should use VTL templates (e.g. for DynamoDB) by default until it's either impossible or the VTL is getting too complex
X: but why?
me: well, let's see..
me: you can do whatever you want with Lambda, that gives you a lot of flexibility, but also all the drawbacks
X: such as?
me: cold starts, esp for resolvers that don't see a lot of traffic or the complexity overhead for mitigating them (Provisioned Concurrency or lambda warmer)
me: and you also have to consider the operational limits specific to Lambda, such as the *soft regional concurrency limit, and the *hard limit of 500 new concurrent executions per min after the init burst capacity (3000 concurrent executions)
X: ok, fair point.. anything else?