Gergely Orosz Profile picture
Apr 11 21 tweets 9 min read
Nearly a week since ~400 companies can not use any @Atlassian products like JIRA, Confluence. I've talked to several impacted teams and they are upset how poorly Atlassian is handling the biggest outage these teams experienced.

A thread on what Atlassian needs to fix and why: Atlassian system status: all red.
1. Outages happen, no matter how you try to avoid them. No one should be upset about this incident, nor search for who is to blame (apparently, a maintenance script)

What matters is what happens AFTER the incident is discovered.

2. Initially, Atlassian did just fine in notifying about something being wrong. They posted updates after the incident started. Updates posted after the incident started.
3. However, 6 hours later, the incident was still ongoing.

This was strange... because according to Atlassian's own protocol, they are able to restore data for services like JIRA in 6 hours: atlassian.com/trust/security… RTO (recovery time) for JIRA (Tier 1) should be <6 hours
4. The cause of the outage is accidental data deletion. Happens. Should be easy for Atlassian to restore. Quote from them:

"Atlassian tests backups for restoration on a quarterly basis."

And yet, backups are not working as we speak.

atlassian.com/trust/security…
5. Ok, so Atlassian apparently has an outage where their recovery is not working. Not great, but happens.

What should you do in this case?

Tell customers what is happening.

Customers tell me there has been radio silence for 6 days for the most part. This is not ok.
6. Mission-critical customer infrastructure. Several customers impacted went all-in on Atlassian, including using @Opsgenie for their oncall alerting - it's like @pagerduty, but by Atlassian.

For them, OpsGenie is also down. Atlassian offers no workaround even for this.
7. Finally, 6 days into the outage, some customers received communication.

It was an update on a ticket that told them...

"Wait more. A lot more."

That's it. No alternatives offered. Just "wait". After a week into the outage. As a paying customer.

8. To add salt to the wound, customers using onsite JIRA installations have no such issues (the outage is specific to Atlassian Cloud).

However, Atlassian discontinued Server products, claiming the Cloud is more reliable. These customers sure don't feel it is. Innovate faster in Cloud Enhanced security. More reliable. E
9. So what should have Atlassian done differently? A lot.

A) Communicate to the world about what is happening. The official Twitter account has not tweeted in 4 days (!!). In the middle of a massive Atlassian outage? This was the last tweet.

A) (Cont'd). No Atlassian exec has issued any statement.

When @Cloudflare has issues much smaller than this, @eastdakota communicates rapidly. Take what happened a few hours into the Okta breach, as they already had updates going out:

B) Talk to your customers!

There are "only" ~400 companies impacted. Yet most of them are in the dark.

Give them updates!

Tell them the root cause so they don't ping me for it (yes, I've told several customers impacted the actual root cause I know from an employee).
C) Offer alternatives to "wait for ~2 more weeks until you can use *any* Atlassian products"!

Some customers just want OpsGenie back. Some want certain Confluence docs. Give them options. Offer to bring back some services earlier.

Give them SOMETHING else than "wait".
D) Start your public postmortem.

Remember when @gitlab lost customer data? I do. They livestreamed how they mitigated the outage and then posted a very detailed postmortem afterward: about.gitlab.com/blog/2017/02/1…

The result? People trusted @GitLab more.
E) Acknowledge the incident & confirm taking responsibility. Explain why the "How Atlassian does Resilience" article does not apply, and why the restoration SLAs are broken. How will customers be compensated?

Why should future customers trust Atlassian if this is not addressed?
F) Call out the good work your engineering teams are doing.

People are working round the clock. Use your reach like the @Atlassian handle to share what is happening.

I hear people are working round the clock. From backchannels. Why not from @Atlassian?
G) Know what is on the line. This is not just about impacted customers. The eng community is watching how this outage is being handled. Decision-makers are taking notes. People are talking "it could have been us, do we have a plan B?"

Atlassian's reputation is on the line.
But please, start with your customers. They deserve better. Talk with them. Communicate directly. Give them alternatives while they wait.

Do this without someone like me asking to do so.

#HugOps to engineers working overtime.

You can, and should do better, @Atlassian.
If you are impacted by the outage, you have no choice to wait weeks - unless Atlassian changes its approach.

You'll probably try out other vendors as you can't do anything else.

I can recommend @linear. They are offering help those caught in this:

*Finally*, 5 days after full silence and a day after I published the above tweet suggesting how to fix things, Atlassian is communicating more openly.

Thanks for the transparency as step 1, @Atlassian.

Unfortunately, impacted customers are telling me @Atlassian is not doing what they are communicating publicly.

This is from a company who has been down since 5 April. Atlassian, why are you not talking with your own, paying customers? Why do you not give alternatives? Shame… I appreciate your feed following the Atlassian outage. I'm a

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Gergely Orosz

Gergely Orosz Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @GergelyOrosz

Apr 10
I still think about Skype’s CTO who got let go a month into a job because of vanity.

TechCrunch reported on his hiring, and some negative comments came that no one read.

Then his wife threatened TC with a lawsuit, and so TC reached out to Skype…

techcrunch.com/2010/08/18/sky…
Had he/the wife ignored the comments and never threatened with legal action, nothing would have happened & he would have been CTO.

But threatening with a lawsuit meant TC had to stop deleting comments (that they were!) and report on this fact.

Wrecked his whole career over this
And TechCrunch was innocent in all this. The first article was positive about the guy: techcrunch.com/2010/07/06/sky…

They deleted the majority of negative comments as they came in.

But once threatened with legal action they had zero choice but follow protocol they always do.
Read 4 tweets
Apr 8
A major reason I don't engage with web3/crypto/NFT in any way:

The space attracts too many people wanting to make a quick buck. Opportunists. Scammers. Con artists.

When you have a crowd of these people, all behind a pseudonymized identity... why on earth would you engage?
In my DMs I'm now regularly getting messages from people with monkey avatars, claiming they work at Big Tech and wanting my take/help on web3.

My take I have no way to tell if they are a scammer, just wasting my time, or are telling the truth.

And I don't have time for this.
If you're a software engineer, there's a huge amount of opportunities to work at.

Yes, web3 is one of these. But there are other areas where you will work with real people, real identities, and are less likely to enable a group of anyonmous people wanting to make a lot of money.
Read 6 tweets
Apr 6
Read someone comparing Fast w Fyre Festival.

Full disagree.

Fast pitched to VCs, took their money, and spent their money.

Unlike Fyre, they did not trick everyday people. There is no fraud. Employees were paid in full.

It's how risky startups work, but we tend to forget this.
Would have I given my money to Fast based on their pitch deck?

Hell no.

But I would have also not invested in Tesla or even Facebook early on.

VCs operate with high risk. They lost this money: and are fine with it!

It's not your money. It's not public money. It's VCs money.
With Fyre, ticket holders, local vendors and investors all lost, plus employees got fired in some shady way.

With Fast, investors lost. Employees got let go (but paid extended health insurance, Fast lined up jobs, employees now getting tons of reachout).

Cannot compare the two.
Read 4 tweets
Apr 4
I can confirm this.

During my 4 years at Uber, in Amsterdam, no one who reported to me left Uber during that time (lots of luck to this, admittedly).

After year 2 people visibly did *so much* better... same with teams with tenured people and no attrition all-round. Crushed it.
People staying or leaving is often situational, but there are things managers can do to help.

Admittedly, I was certain people would leave in a matter of time, and did what I wished my managers would have done to stop me from leaving.

Some advice:

And a nuance: people *did* leave my team. I was always 100% supportive of this, and sometimes nudged people to consider it.

People also talked with me about leaving Uber and I was supportive of this as well (don't tell my former manager!).

And then came the layoffs in 2020...
Read 4 tweets
Apr 4
"I joined a company which brands itself as a tech-first company. I was super excited.

As I was setting up my laptop, I noticed I have no admin rights. Turned out I needed to request permission to install anything. And my request for Visual Studio Code was rejected."

🤯
This was a real quote from a senior engineer who left said company after a year. They shared:

“It really started with the overly restrictive environment. It felt I was handcuffed to do my job.

I now work at a scaleup in the same space. It just feels like a breath of fresh air.”
Just to spell it out: all tech-first companies make it dead easy for engineers to use the tools they want, including having admin rights on their machines. They take care of security + stay compliant in less intrusive ways.

Not doing so is a sign of not caring about engineers.
Read 5 tweets
Apr 1
Plenty of news about Fast, the startup which allegedly generated $600K revenue in 2021, burning through most of their ~$100M funding.

I got a reach out to interview in 2020 but immediately passed. It's because I did my research on the founder CEO. Real shady actions in the past:
1. The CEO's real name is not "Domm". It's "Dominic". If you search for "Dominic Holland" you find articles of him and his last startup in Australia.

It includes a $15M dispute, going bust, firing staff via text messages and lots of other shady stuff:

abc.net.au/news/2018-06-2…
2. Dominic then started what would become Fast.

He hired contractors remotely from Nigeria at ~$200/week. Once he raised funding off the back of what they built: he terminated everyone and never acknowledged their work.

What a shitty attitude.

Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(