Tasdik Rahman Profile picture
Aug 28, 2019 27 tweets 26 min read Read on X
Started reading @Google 's #SRE book and some insights from it so far, will keep updating the thread as I finish the chapters. (1/n)
@Google The initial chapter touches upon the idea, that fixes being pushed with human-interruption need to scale linearly as the product grows/scale increases. Practicing the ideology of building systems which would in turn manage the hand holding which #syadmins do is radical (2/n)
@Google 100% uptime is probably never the right reliability target: not only is it impossible to achieve, it’s typically more reliability than a service’s users want or notice. Matching the profile of the service to the risk the business is willing to take. (3/n)
@Google Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about #outages with stakeholders, and allows multiple teams to reach a consensus faster by making decisions by looking at data points from the #monitoring system. (4/n)
@Google This is in alignment with the #SRE team pushing for reliability & product teams for feature releases, both being important. The system being unreliable will affect customer experience, if there are no product features being released, may lead customers to other products. (5/n)
@Google Keeping a check on toil is something which has been repeatedly emphasized in this chapter, and it does make sense when you have observed this first hand. As the org grows, if toil is not checked, it will take up the whole team's bandwidth eventually landing.google.com/sre/sre-book/c… (6/n)
@Google It has been wonderfully phrased here, that "If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow" landing.google.com/sre/sre-book/c… (7/n)
@Google But what is toil? Chances are that work which is manual, repetitive, automatable, tactical, has no enduring value (system remains at the same state after you do it), O(n) with service growth has a high chance to be categorized as toil. (8/n)
@Google Too many pages? And the #oncall engineer will start second-guessing and even miss a real page that's masked by noise. More so, paging an employee is expensive as it breaks their workflow. Having minimal noise and good signal is the sign of a mature #monitoring system. (9/n)
@Google Rules that generate alerts for humans should be simple to understand and represent a clear failure. The what's broken indicates the symptom, the why indicates the possible cause. #SREBook #SRE (10/n)
@Google Doing automation thoughtlessly will only create as many problems as it will solve. Even though software based automation is better than manual, the sweet spot would be an autonomous system which self heals. #SREBook #SRE (11/n)
@Google What's the value of having automation? It adds consistency, a platform, faster repairs and faster action helping save time. #SREBook #SRE (12/n)
@Google Actions like changing the servers resolv.conf are repetitive & as the number of machines grow, chances of something going wrong in a manual task also go up. This inevitable lack of consistency leads to mistakes, issues with data quality, reliability problems. #SREBook #SRE (13/n)
@Google If designed and done properly, automatic systems also provide a platform on top of which you can build things/extend features on top of it, which can be shared among numerous teams. Bug once solved in this platform will also be fixed for the other teams. #SREBook #SRE (14/n)
@Google A very good example of a platform for automation is @kubernetesio, and in our own case of @gojektech , github.com/gojek/proctor on top of which we have numerous automation scripts for teams to use helping them achieve numerous things via automation. #SREBook #SRE (15/n)
@Google @kubernetesio @gojektech Faster recovery? Ideally, you would get paged & someone would ack it & do an analysis of what went wrong and then an action would be taken. In the case of an autonomous system which is self healing, the MTTD & MTTR would be lower than a human in most cases. #SREBook #SRE (16/n)
@Google @kubernetesio @gojektech On of the big reason for having automation(if done properly) is that it saves the human toil required to hand hold their systems, which would eat all your time as the scale increases along with the complexity. #SREBook #SRE (17/n)
@Google @kubernetesio @gojektech Automation of something is a gradual path, No automation->hacky script in ~/home -> script is shared among other devs -> improvements/gets added to your automation platform -> systems is self sufficient/autonomous/self-healing #SREBook #SRE (18/n)
@Google @kubernetesio @gojektech Automation for something specific should be owned by the owners of the service which use the script, reason? They have the highest incentive to maintain/fixing bugs/adding features to the automation. Otherwise if left with someone else, it will stagnate. #SREBook #SRE (19/n)
@Google @kubernetesio @gojektech If possible, the automation should be idempotent. If the automation fails in the middle for some reason, the automation should be able to pick up from where it left and redo what it has already done. Off the shelf tools like @ansible do try following that. #SREBook #SRE (20/n)
@Google @kubernetesio @gojektech @ansible Running reliable services require reliable release processes. The binaries and configurations require to be built in a reproducible, automated which would make releases repeatable and not unique snowflakes #SREBook #SRE (21/n)
@Google @kubernetesio @gojektech @ansible In order to scale your team, teams should be self sufficient which can be achieved with processes and tooling which enforce those policies. #SREBook #SRE (22/n)
@Google @kubernetesio @gojektech @ansible Ideally each team should be able to decide how they do their deployments depending on their service, user facing apps in most cases would require a gradual rollout while someone else might be ok with a full deployment. The deployment tool should be flexible. #SREBook #SRE (23/n)
@Google @kubernetesio @gojektech @ansible When equipped with the right tools, proper automation, and well-defined policies, developers and SREs shouldn’t have to worry about releasing software. Releases can be as painless as simply pressing a button. #SREBook #SRE (24/n)
@Google @kubernetesio @gojektech @ansible Software should be boring! It should not have unexpected surprises, but be predictable. A simpler/minimal API is also a hallmark of a well understood problem. Less is more in software. Dead code should be removed religiously, as it's a ticking time bomb. #SREBook #SRE (25/n)
@Google @kubernetesio @gojektech @ansible Lower the coupling between pieces of software, the more you can release fearlessly & with increased confidence. Measuring a single change you pushed is always easier than measuring what was the effect of pushing 100 changes. #SREBook #SRE (26/n)
@Google @kubernetesio @gojektech @ansible Having smaller changes would also make it easier to debug and pinpoint faster on what caused a regression. Hence software simplicity is a pre-requisite to reliability. #SREBook #SRE (27/n)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tasdik Rahman

Tasdik Rahman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tasdikrahman

Oct 4, 2020
One thing which I tried doing differently this time with one of my side projects is to do TDD from the start. Someone may ask why? It's just a side project no? (1/n)
One reason is that, for some of my past side projects, when someone creates an issue/submits a PR. I wouldn't necessarily remember everything which I did/why I did x instead of y, when I would have authored it (more on how this can be improved later) (2/n)
Coming back to say reviewing a bugfix/feature PR. Having no coverage for those specific routines which were modified, would mean I either would have to rely on my gut feeling, or I would have to test it by pulling the changes. (3/n)
Read 19 tweets
Oct 3, 2020
Releasing v0.2.0 for Bhola github.com/tasdikrahman/b…! This release comes with the ability to push SSL cert expiry notifications to @SlackHQ for the domains which bhola is tracking (1/n)
It will alert for all the domains, which have already expired/are about to expire within the buffer period which you have set & send notification to your slack channel via webhook endpoint, periodically checking in the interval set by the operator, for expiration. (2/n) Image
This makes bhola, tackle the part of the problem for you where you don't have to keep checking the dashboard of bhola on which certs are expiring, rather it telling you proactively, on what are the certificates which need renewal and need your immediate attention. (3/n)
Read 6 tweets
Sep 27, 2020
Took this out for a spin for my repo (1/n)
Another player in the container registry space, but none the less, I now don't have to play around with my @Docker hub credentials, to push the container image for my repo's, one less thing to worry about (2/n)
Also given the recent policy change with regards to bumping off the container images if not pulled for x duration was something which I didn't wanna get put into, although completely makes sense for the maintainers of the registry. I couldn't find anything specific for GHRC (3/n)
Read 7 tweets
Sep 23, 2020
Have been using @github pipelines for one of my public repositories and it has been a great experience so far, having the CI experience (build, lint, run tests etc.) right in front of you, was something had been missing in their UX. (1/n)
Rather than delegating the CI experience to an external entity, which would create another lookup and another thing to worry about. This pretty much has taken the developer experience a step further. (2/n)
Another killer feature, was github packages! Which means, now we have a container registry too in @github (along with a few other formats of packages supported)! Which is again building on top of github pipelines feature set. (3/n)
Read 4 tweets
Sep 18, 2020
Do you sometimes wake up, with a call by someone from your team, telling you some SSL cert has expired? Do you keep track of SSL cert expirations on your to do notes or excel sheets? Would you like to be on top of such x509 cert renewals? github.com/tasdikrahman/b… is for you (1/n) Image
v0.1 of Bhola, will give you a dead simple API, which you can use to ask Bhola, to track domains which have certs attached to it. It automatically checks for the cert expiration in the background keeping note of when is it expiring. (2/n)
The operator can set a buffer period, which would bhola, then use to see if it meets the threshold number of days, before the cert is going to expire, before marking the cert, that it needs renewal asap. (3/n)
Read 8 tweets
Aug 7, 2020
Load average seems decent so far, although the ssh is unusually laggy, no process seems to hog too much of resource, will dig on this later. Image
In before I brick my router on a Friday night, as I update it's DHCP config to start pushing my @Raspberry_Pi 's IP as the dns server (running pi-hole), to all it's clients.
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(