Nearly a week since ~400 companies can not use any @Atlassian products like JIRA, Confluence. I've talked to several impacted teams and they are upset how poorly Atlassian is handling the biggest outage these teams experienced.
A thread on what Atlassian needs to fix and why:
1. Outages happen, no matter how you try to avoid them. No one should be upset about this incident, nor search for who is to blame (apparently, a maintenance script)
What matters is what happens AFTER the incident is discovered.
2. Initially, Atlassian did just fine in notifying about something being wrong. They posted updates after the incident started.
3. However, 6 hours later, the incident was still ongoing.
This was strange... because according to Atlassian's own protocol, they are able to restore data for services like JIRA in 6 hours: atlassian.com/trust/security…
4. The cause of the outage is accidental data deletion. Happens. Should be easy for Atlassian to restore. Quote from them:
"Atlassian tests backups for restoration on a quarterly basis."
5. Ok, so Atlassian apparently has an outage where their recovery is not working. Not great, but happens.
What should you do in this case?
Tell customers what is happening.
Customers tell me there has been radio silence for 6 days for the most part. This is not ok.
6. Mission-critical customer infrastructure. Several customers impacted went all-in on Atlassian, including using @Opsgenie for their oncall alerting - it's like @pagerduty, but by Atlassian.
For them, OpsGenie is also down. Atlassian offers no workaround even for this.
7. Finally, 6 days into the outage, some customers received communication.
It was an update on a ticket that told them...
"Wait more. A lot more."
That's it. No alternatives offered. Just "wait". After a week into the outage. As a paying customer.
8. To add salt to the wound, customers using onsite JIRA installations have no such issues (the outage is specific to Atlassian Cloud).
However, Atlassian discontinued Server products, claiming the Cloud is more reliable. These customers sure don't feel it is.
9. So what should have Atlassian done differently? A lot.
A) Communicate to the world about what is happening. The official Twitter account has not tweeted in 4 days (!!). In the middle of a massive Atlassian outage? This was the last tweet.
A) (Cont'd). No Atlassian exec has issued any statement.
When @Cloudflare has issues much smaller than this, @eastdakota communicates rapidly. Take what happened a few hours into the Okta breach, as they already had updates going out:
There are "only" ~400 companies impacted. Yet most of them are in the dark.
Give them updates!
Tell them the root cause so they don't ping me for it (yes, I've told several customers impacted the actual root cause I know from an employee).
C) Offer alternatives to "wait for ~2 more weeks until you can use *any* Atlassian products"!
Some customers just want OpsGenie back. Some want certain Confluence docs. Give them options. Offer to bring back some services earlier.
Give them SOMETHING else than "wait".
D) Start your public postmortem.
Remember when @gitlab lost customer data? I do. They livestreamed how they mitigated the outage and then posted a very detailed postmortem afterward: about.gitlab.com/blog/2017/02/1…
E) Acknowledge the incident & confirm taking responsibility. Explain why the "How Atlassian does Resilience" article does not apply, and why the restoration SLAs are broken. How will customers be compensated?
Why should future customers trust Atlassian if this is not addressed?
F) Call out the good work your engineering teams are doing.
People are working round the clock. Use your reach like the @Atlassian handle to share what is happening.
I hear people are working round the clock. From backchannels. Why not from @Atlassian?
G) Know what is on the line. This is not just about impacted customers. The eng community is watching how this outage is being handled. Decision-makers are taking notes. People are talking "it could have been us, do we have a plan B?"
Atlassian's reputation is on the line.
But please, start with your customers. They deserve better. Talk with them. Communicate directly. Give them alternatives while they wait.
Unfortunately, impacted customers are telling me @Atlassian is not doing what they are communicating publicly.
This is from a company who has been down since 5 April. Atlassian, why are you not talking with your own, paying customers? Why do you not give alternatives? Shame…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
A major reason I don't engage with web3/crypto/NFT in any way:
The space attracts too many people wanting to make a quick buck. Opportunists. Scammers. Con artists.
When you have a crowd of these people, all behind a pseudonymized identity... why on earth would you engage?
In my DMs I'm now regularly getting messages from people with monkey avatars, claiming they work at Big Tech and wanting my take/help on web3.
My take I have no way to tell if they are a scammer, just wasting my time, or are telling the truth.
And I don't have time for this.
If you're a software engineer, there's a huge amount of opportunities to work at.
Yes, web3 is one of these. But there are other areas where you will work with real people, real identities, and are less likely to enable a group of anyonmous people wanting to make a lot of money.
"I joined a company which brands itself as a tech-first company. I was super excited.
As I was setting up my laptop, I noticed I have no admin rights. Turned out I needed to request permission to install anything. And my request for Visual Studio Code was rejected."
🤯
This was a real quote from a senior engineer who left said company after a year. They shared:
“It really started with the overly restrictive environment. It felt I was handcuffed to do my job.
I now work at a scaleup in the same space. It just feels like a breath of fresh air.”
Just to spell it out: all tech-first companies make it dead easy for engineers to use the tools they want, including having admin rights on their machines. They take care of security + stay compliant in less intrusive ways.
Not doing so is a sign of not caring about engineers.
He hired contractors remotely from Nigeria at ~$200/week. Once he raised funding off the back of what they built: he terminated everyone and never acknowledged their work.