One ❤️ = one fact 💡about running backend software in production 🔥
We're starting less than in 24 hours 🍿
More: safaribooksonline.com/library/view/p…
It includes the knowledge about infrastructure, data storages, monitoring, and deployment pipeline.
💰 Good if you know the budget for all the things.
1) Software Firefighting 🔥
2) Reactive Monitoring 👨🏭
3) Proactive Remediation 🧐
4) Preventive Maintenance 🦉
You write some code, push it to production, and hope that everything works fine 🙏
Yes, you do not have a lot of information about how the system behaves and how healthy is that.
1) Student projects
2) Pet projects
3) Hackathons
But I used to work in the environment when business-related applications do not have any telemetry.
My boss called me at 11 PM to tell that the 💩 software is not working.
Not good.
After that, I jumped the box that runs software via SSH and to understand what's going on...
Another option if your logs are not very verbose or you want to jump into lower level - attach a debugger to a running process.
More: gnu.org/software/gdb/
🤠👿
I did that. It's bad.
➕ Pros:
1) Heroism. You feel like a hero rescuing your business from disasters.
2) Cheap. No investments in your infrastructure are needed.
🤠🐮
1) Affects the reputation of your business. You're notified only when customers/business owners find that service does not work.
2) Time consuming. You need to reproduce the issue in prod to gather telemetry right on the box.
3) Not accountable. Hotfixes in production that might not be presented in the repository. And the other engineers on your team might learn nothing how to fix such issues
4) Stressful. Dangerous to your mental and physical health
☹️
5) Non cooperative. It's hard to handover work if you need to step-out.
I do not recommend running business software applications in the Firefighting mode.
To remove the disadvantages of the approach, we need to go to the next level.
1) Collect live stats - it's easier if you're trained.
2) Localize the problem.
3) Find a similar issue in the registry of incidents/bugs.
4) Learn how to reproduce that.
5) Make a hotfix.
1) gdb
2) top/prstat/iostat
3) tcpdump
4) strace/ktrace/truss
5) ltrace
6) dtrace
That's it about firefighting 👨🚒👩🚒🔥.
We still have 3 others maturity levels to overview.
We can change our infrastructure to collect telemetry about services automatically instead of doing this manually only during incidents 💥
Purpose:
- Save time
- Remove the need to reproduce the issue in production 🏴☠️
I recommend making the following bits of data accessible for your backend engineering team (part 1)