Profile picture
Viach Kakovskyi @BackendAndBBQ
, 17 tweets, 4 min read Read on Twitter
#BackendAndBBQThread n.0

One ❤️ = one fact 💡about running backend software in production 🔥

We're starting less than in 24 hours 🍿
1. I see that more and more teams follow "You Build It, You Run It" motto started by @awscloud. It's working great - the successful teams care not only about writing good code but also how the code is executed in the production environment.

More: safaribooksonline.com/library/view/p…
2. I think that the teams should know how the software will be running in production since the very beginning of the project.
It includes the knowledge about infrastructure, data storages, monitoring, and deployment pipeline.
💰 Good if you know the budget for all the things.
3. Today I see the following levels of maturity for running software in production:

1) Software Firefighting 🔥
2) Reactive Monitoring 👨‍🏭
3) Proactive Remediation 🧐
4) Preventive Maintenance 🦉
4. Software Firefighting - is the spot where engineers usually start. No experience is needed.

You write some code, push it to production, and hope that everything works fine 🙏
Yes, you do not have a lot of information about how the system behaves and how healthy is that.
5. Software Firefighting is OK when it's not a big deal if your service does not operate properly. My examples:

1) Student projects
2) Pet projects
3) Hackathons

But I used to work in the environment when business-related applications do not have any telemetry.
6. I used to work in the environment when business-related applications do not have any telemetry.

My boss called me at 11 PM to tell that the 💩 software is not working.
Not good.
After that, I jumped the box that runs software via SSH and to understand what's going on...
7. In Software Firefighting mode you can try to reproduce the issue in production to see in logs something useful.
Another option if your logs are not very verbose or you want to jump into lower level - attach a debugger to a running process.

More: gnu.org/software/gdb/
8. When you find the root cause - you experiment on the live production instance try to patch the code, eventually create a PR that starts with the prefix `hotfix` and deploy this straight to production without proper peer reviews.

🤠👿

I did that. It's bad.
9. The pros of Software Firefighting - the cowboy style of running production software:

➕ Pros:
1) Heroism. You feel like a hero rescuing your business from disasters.
2) Cheap. No investments in your infrastructure are needed.

🤠🐮
10. Software Firefighting - Cons (part 1):
1) Affects the reputation of your business. You're notified only when customers/business owners find that service does not work.
2) Time consuming. You need to reproduce the issue in prod to gather telemetry right on the box.
11. Software Firefighting - Cons (part 2):

3) Not accountable. Hotfixes in production that might not be presented in the repository. And the other engineers on your team might learn nothing how to fix such issues
4) Stressful. Dangerous to your mental and physical health

☹️
12. Software Firefighting - Cons (part 3):

5) Non cooperative. It's hard to handover work if you need to step-out.
I do not recommend running business software applications in the Firefighting mode.

To remove the disadvantages of the approach, we need to go to the next level.
13. But before moving to the next level - some tips about Software Firefighting.

1) Collect live stats - it's easier if you're trained.
2) Localize the problem.
3) Find a similar issue in the registry of incidents/bugs.
4) Learn how to reproduce that.
5) Make a hotfix.
14. The list of tools to practice to be a better Software Firefighter:

1) gdb
2) top/prstat/iostat
3) tcpdump
4) strace/ktrace/truss
5) ltrace
6) dtrace

That's it about firefighting 👨‍🚒👩‍🚒🔥.
We still have 3 others maturity levels to overview.
15. Reactive Monitoring is the 2nd level of maturity.

We can change our infrastructure to collect telemetry about services automatically instead of doing this manually only during incidents 💥

Purpose:
- Save time
- Remove the need to reproduce the issue in production 🏴‍☠️
16. When an incident happens and our customers claim that the service does not operate properly - 99% of the information needed for the investigation should be already available.

I recommend making the following bits of data accessible for your backend engineering team (part 1)
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Viach Kakovskyi
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!