Ben Kehoe Profile picture
27 Nov, 11 tweets, 2 min read
I want to talk a bit about what this was like.
tl;dr: it was *long* and inconvenient timing but, as an operations team, not particularly stressful. Questions of “when”, not how or if systems would come back. A lot of waiting and watching—and that’s desirable.
Wednesday, there was not much to do. AWS IoT was hit hard by the Kinesis outage, which meant lots of stuff was simply not going to work. And CloudWatch outage meant we couldn’t see what was and wasn’t working cloud-side.
The resolution to the main outage came late in the evening. This meant we were up late because we needed to be there when we had visibility again into Lambda logs and DynamoDB metrics. If I had known that would be 6 hours later, I would have sent everyone to bed first.
Once we had that, we had to up-provision a couple of DDB tables for the extra backfilling traffic, and there was one issue we needed developer feedback for. So we were up early to meet with them to resolve it before higher traffic later in the day.
That ended up being a one-line code change and a tweak to an env var.
Then it was waiting on a follow-on effects in AWS services to get worked out, which took another 8 hours before everything was all back to normal.
The point here is that we were never feverishly working away trying to fix something that was broken. We were calmly and methodically evaluating the state of the system, and being on hand to take the next steps when something changed.
If this happened often, we’d probably have automated detection and notification of “something changed” for an incident and gone to sleep while waiting for that. This is the first incident of any kind that’s kept us up all night.
All of this is to say that this experience, unpleasant as it was, changes none of my thinking about how to build or who to build it on.
With #serverless, you’re trading the ability to take positive action in any given incident *for vastly fewer incidents*
For the operations team during an incident, you’re getting rid of a major source of stress, because the majority of the responsibility for fixing what’s broken is on the experts in the given system that’s broken. They are good at fixing it, better than we would be.
I’m absolutely not saying you shouldn’t make changes in your architecture in response to this outage. But be deliberate about it, and always focus on the TCO of those decisions.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ben Kehoe

Ben Kehoe Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ben11kehoe

30 Jun
When doing infra as code, you want to plan your deploys around what infra state—that is, resource graphs—you want to achieve, not around what graph manipulation your tooling restricts you to. To that end, here's a maturity model for CloudFormation support for AWS services 1/10
Stage 1: CloudFormation support at launch, for every launch. For cloud-native customers, if it's not in CloudFormation, it doesn't exist. This should be table stakes at this point. But it's not, so let's make it a milestone. 2/10
But CloudFormation support at launch isn't the real story. CloudFormation resource designs are constrained by the underlying control plane API of the service, and these are often designed long before CloudFormation support is even thought about. 3/10
Read 10 tweets
8 Mar
AWS SSO has a naming problem. It’s going to quickly be the best way to enable access into AWS *using your own Single sign on identity provider for authentication.* Its capability to be an IdP or single sign on provider to other apps is not where its value lies for most people 1/
Today, you need your IdP to understand everything about access into AWS—the mapping of people to accounts and roles, because that information needs to get packaged up in a SAML assertion. Size limits on the SAML assertion make sensible multi-account setups hard. 2/
AWS SSO moves those mappings into AWS. Authentication uses your IdP—you sign into your own single sign on system, but the SAML that gets sent over is just your identity; the access you’re granted is maintained in AWS SSO. 3/
Read 10 tweets
26 Jan
AWS has been adding a lot of features to use OAuth directly with API Gateway, skipping Cognito Identity Pools and AWS IAM. I think this is regressive. A lot of useful functionality is coming out of it, but we should hope to get that IAM-side instead.
For example, using OAuth, you can define allowed scopes on a given route (resource+method). Now you've got attribute-based access control in your app. Great! Except all the excellent tooling around AWS IAM is no longer available to you.
Or, you can use a Lambda authorizer to create a policy based on the token contents. It's super flexible, you can accomplish complicated authorization scenarios. But now you own the authorization system, including all the security monitoring and operations associated with that.
Read 8 tweets
21 Dec 19
Extra excited about this now. sso:GetRoleCredentials takes account and role name parameters (weird that it's not a role ARN but whatever). This looks to me like the client is in control of what IAM principal the user will become. This is a good thing! ⏬1/9
A great enabling capability for an org is creating single page apps w/ serverless backends, and now with these APIs in JS those apps can use SSO. One of the difficulties with these apps is that there may be many copies of a given app, for dev or just in many different accts 2/9
You should be able to stand up a copy of an app, go to its front page, and log in with SSO. But most SSO flows assume well-known URLs that you give to the SSO system. But with these AWS SSO APIs/Device Flow, that's no longer a problem. 3/9
Read 10 tweets
14 Jun 19
One of @awscloud's biggest weaknesses is that while product teams strive to create great experiences for their services, AWS users have to use many teams' services and there's very little accountability for the cross-AWS-team experience.
AWS is the best in the world at creating autonomous teams. We’re asking AWS to figure out how to scale their customer obsession in a way that cross-team feedback is appropriately prioritized.
We’re glad docs are in GitHub, but we want PRs to get addressed faster. We shouldn’t have to go to 90 individual teams to beg them to do that! There should be customer advocates at AWS that see what we see, and are empowered to help make it happen
Read 6 tweets
18 Feb 19
Statelessness is not the critical property of #serverless compute, it's ephemerality. Being positively limited in duration means the provider can *transparently* manage the platform, no scheduled (or unscheduled, in Fargate's case) downtime needed.
Currently, most serverless compute is time-limited, run in response to fixed-size(-ish) events. But finite-input-limited compute is possible; this is AWS Batch's model.
Several acadamic papers, including the recent one from Berkeley, have equated serverless with existing FaaS models and complained of its inapplicability to big data processing. I think this is a failure of imagination.
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!