, 19 tweets, 7 min read Read on Twitter
Arun Kumar Singh is now talking about a journey to self-healing infrastructure. #DevOpsDays He’s an SRE at Adobe, where each product has ~30k servers. If each server alerts once per week, that’s 30k alerts a week for the poor SRE! 😱 What to do?
Auto-remediation (self healing) is a workflow triggered by an alert or event, which fixes the problem. Simplest example: the server is off, turn it on again. #DevOpsDays Slide: general auto-remediation workflow
Arun’s team used to use SaltStack for auto-remediation. Pros: already used within Adobe; easy access management with LDAP; easy secret management; based on Python and yaml, so easy to extend. #DevOpsDays
SaltStack cons: with multiple projects and large infrastructure (=heavy traffic), there was a high chance of events getting stuck in the event bus. People didn’t want to add an extra config mgmt tool. No automatic ChatOps support. #DevOpsDays
The next solution they tried was Rundeck. 🤖 Pros: Adobe was already used in Adobe for deployments/self-service tasks; uses LDAP; no need for clients to install the agent; a rich API and ecosystem of plugins; *exceptional* role-based access control. 😄 #DevOpsDays
But the big con with Rundeck: no way to run complex workflows with if-else conditions. These had to be handled in the execution script. #DevOpsDays
This is turning into the story of Arun and the three bears. 👨🏾‍💻🐻🐻🐻 Finally, they decided on a StackStorm-based auto-remediation solution. Yay, complex workflows are easy here! #DevOpsDays
With StackStorm, all the possible scenarios that could cause an outage can be checked in parallel, saving time. It uses RabbitMQ/Kafka for streaming messages (so no event bus traffic jams). Also has chatops support, an active community, etc. #DevOpsDays
What’s the con?
Role-based access control is only available in the paid version. :( Arun’s team are now writing their own version. #DevOpsDays
Monitoring and analytics can show you where the pain points in your systems are. One server or process using loads of resources? You should fix that. But with auto-remediation, it’s not breaking things for your customers in the meantime. #DevOpsDays
When Arun’s team showed this application to different people, they came up with various suggestions for other uses for it. #DevOpsDays Auto-scaling? Scheduling tasks? Connecting microservices? Use the data for co-relation, anomaly detection or machine learning?
With the system they built using StackStorm, ~70% of outages get fixed by themselves! They have 40% less customer service outages and save 4 SRE hours per day. :) #DevOpsDays
Arun’s final advice: don’t celebrate adding more auto-remediation workflows. They’re just a bandaid for deeper problems. Celebrate when you remove them! #DevOpsDays
Q: you changed your mind twice on the technology you use for this. How painful was it?

A: very painful! But we had to do it; you have to make decisions based on what you want for the future, not just right now. #DevOpsDays
Q: can you tell us about some of the 30% of cases that can’t be automatically remediated?

A: mostly cases where there are lots of complicated dependencies between microservices, or between clouds (Adobe has three different clouds). #DevOpsDays
Ah, I just got Arun’s Twitter handle. Thank you for the informative talk, @arun_2803! #DevOpsDays
Q: you do have a feedback loop to fix the problems that lead to these auto-remediation events, right?

A: yes, we make sure that every auto-remediation event is linked to an SRE or developer ticket. #devopsdays
That was actually only about 1% of the question, because the questioner went on and on, answering his own questions (which @arun_2803 had already covered). Not cool. Conference staff managing Q&A should be bolder about cutting questioners off!
@threadreaderapp Please unroll.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Rae Knowler
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!