After using AWS for ~14 years, I've internalised a handful of design patterns that I try to apply to my own software. I'm keen to know if it's the same for other folks.
Roughly: tags, IDs (thrice), limits, pagination.
(I'm not going to use the thread emoji)
1: Tags.
A lot of software has support for tags, but it's usually a set of strings. This is useful in the case of "show me all resources tagged 'engineering'". Or even "show me all resources tagged 'engineering' and 'frontend'".
AWS resources go one step further and implement key-value tags. So in addition to the above, you can ask the question "show me all resources grouped by the 'department' tag key". This is super useful for power users, folks in finance, etc.
2: Resource ID prefixes
Almost all objects in a web service will have some kind of ID. It might be an auto-incrementing integer, or a GUID, a random string, etc.
When debugging, you will often ask a user for the ID of the misbehaving object. You look, but it's not in the DB??
Turns out they gave you a session ID instead of an order ID. Doh!
This is where resource ID *prefixes* are insanely useful. Think EC2 instance IDs: i-abc123 or EBS volumes: vol-def456.
Now when you ask a user for an ID, you'll know immediately if it's the wrong thing entirely.
2b: Fully qualified IDs
Sometimes an object ID by itself isn't enough. You store your data hierarchically on S3, so an invoice ID alone can't find its PDF. You also need the customer ID.
This is where AWS uses ARNs. An ARN should be sufficient to locate the stored object.
E.g. instead of passing around invoice IDs like inv-12345, you pass fully qualified IDs like eu:c-acd223:inv-12345. With that string, your code knows the PDF is in the EU partition, under the c-acd223 customer directory, file name inv-12345.pdf. Much quicker troubleshooting.
AWS doesn't always get this right! When ECS first launched, task ARNs were of the format arn:aws:ecs:${region}:${account}:task/${taskId}. But you could have multiple clusters in a single region - so which cluster do you look up?
They later migrated to :task/${cluster}/${taskId}.
2c: ULIDs > GUIDs >> Integers
This is a bonus one on the theme of IDs, because I don't think AWS does it.
GUIDs are better than auto-incrementing integer IDs because they don't reveal the number of objects in your DB and they don't have contention on the "next ID" counter.
But ULIDs are better than GUIDs because they encode creation timestamps in the prefix. This means they are naturally sortable by creation time. This is useful for databases like DynamoDB or Redshift where data can be stored sorted and you can easily return the latest N objects.
It's also useful in unexpected places. Maybe you think you don't need creation timestamp in your customer IDs? I mean, how often are they created? And they rarely ever need to be sorted.
What about when you migrate data store? Your proxy (in the hot path of every request) can forward any customers created after date X to service B and older ones to service A -- all without needing a roundtrip to the DB to look up if they were created before/after the cutover.
3: Limits.
AWS publishes limits for just about every service. This serves a number of purposes.
* They can build a service with reliable performance because they know there will never be 10,000x more objects than anticipated during system design.
* Customers can be confident that the service can handle their use case, because it's well within the published limits. Their app isn't going to break because it will hit a pathological behaviour in the AWS service.
* Limits can always be raised over time, or in response to a customer request. This means AWS can provision an appropriate level of resourcing.
IMO, limits show a level of professionalism. It means that the service operator understands what their service can -- and cannot -- do.
4: Pagination.
Every AWS service (that I can think of) uses "next page" cursors for pagination, not page indexes. Hell, maybe they *do* use page indexes under the hood, but that is an *implementation detail* that is not exposed in the opaque "next page" cursor.
This gives you the ability to implement pagination with ease. If your data store is DynamoDB, you can return a wrapped DynamoDB pagination token to your users.
You can also migrate data stores. Maybe you're moving from one that supports page indexes to one that doesn't. Now you don't need to break your API contract or introduce a new v2 API with different pagination.
Maybe your pages need an expensive query to a SQL DB. You don't want to have to repeat that query 10x for 10 pages and throw away 90% of the data each time. This makes your DBA sad.
Instead, on page 1 you retrieve the whole dataset and store it in Redis under a GUID. Now your pagination token can be that GUID + an offset into the Redis array. Now your pages are super speedy to load. You have a happy DBA.
Also on pagination: your docs should make explicit that the number of results on a page can be ≤ ${pageSize}.
This means that you can have more predictable performance for your API. You can make sure a page is never more than 1MB of data.
Or that a page doesn't spill over to a second DB partition and take 15x longer to load as the previous page.
PS: don't forget to encrypt your pagination token, or users will reverse engineer the format and start crafting their own. Ideally using AEAD, so they can't tamper with it at all - and the encryption context should include the customer ID to avoid data leaks.
Anyway, that's all I can think of for now. Do you have any others that I missed?
I partially wrote this list for that sweet Twitter fame, but mostly so I have a reference for when my colleague @kmcquade3 asks why I'm implementing something a particular way.
Daniel had a really good bonus one: Access key IDs.
By having long-lived access key IDs in a regex-able format (e.g AKIA[XX…]) it allows you to detect secret leaks.
GitHub even lets software vendors register their credential regexes and get webhooks on leaks == auto-revoke
I regret to inform that I am extremely back on my bullshit.
I've been thinking about connectivity in unusual places. And I got to thinking: can I establish bidirectional connectivity over the Internet between two EC2 instances in private subnets without a third-party relay?
Typically you would expect that connectivity between instances A and B isn't possible - `ping` fails to yield responses after all. But it turns out that an instance with a public IP address in a VPC with an IGW attached can _receive_ traffic - it just can't respond to it.
This is because response packets are routed via the subnet's route table and would transit via the NAT GW with a different IP. That's why TCP handshakes fail and ICMP or UDP responses just get dropped on the floor of cyberspace.
(1/13) People seemed to like the thread last week about silly IPv6 TOTP possible in AWS EC2. But then @donkersgood said I should do something useful instead.
So here are some useful things that are possible thanks to the AWS Gateway Load Balancer
So far I've seen a few boring firewalls (no offence) and nothing else done with the GWLB. So I made a framework and a few sample apps to demonstrate what's possible.
I made a thing. stepfn.dev is a site for rapidly iterating on AWS Step Function designs. Change a few characters, hit Cmd+Enter, see result ~300ms later. Much faster feedback loop.
The other use is sharing SFNs on Twitter for when you need help.
1/4
It's the first website I've built in React and most of the buttons in the nav don't actually do anything yet. But I figure YOLO. It's all on GitHub so I'd appreciate some frontend help from anyone with more than my one day of experience.
2/4
For example, anyone know how to fix this padding?! I spent more than an hour trying to fix it and almost tore my hair out. I can't afford to do that, my hair is already thinning out at an unacceptably rapid pace.
I like the AWS CodeBuild curated environment Docker images, but I'd like to add a couple of things without having to modify the "install" step across all my projects.
Using the curated Dockerfile adds at least 5 mins to CodeBuild provisioning time. So what can we do to fix it?
I extracted the curated images' layers from the CodeBuild environment and published them to Docker Hub.
Using this as the the base for my own custom images has eliminated the 5+ min delay as CodeBuild no longer needs to gunzip 4GB of layers on every build.
Details here github.com/glassechidna/c… I've only bothered to do the Ubuntu images as that seems to be where AWS is steering us?