GeorgeWilliamHerbert Profile picture
He/Him. IT/Cloud/SRE/PE, rockets, nukes not going bang, space, CS, ships. @miis Adjunct. I speak for myself only. V95.45XA @georgewherbert@sfba.social

Oct 3, 2020, 12 tweets

Friday was Danger Day. 11 years ago at this time, I was leading an extremely uncomfortable outage assessment WebEx session, moreso because I didn’t technically consult there anymore and there had been some hands waved.

en.m.wikipedia.org/wiki/2009_Side…

Around 1am was when the Oracle guy watched the last disk scan for Oracle ASM headers on the LUNs on the Hitachi come back negative and indicated they couldn’t help under the circumstances.

Then the poor T-mobile DBA said “We’re fucked.”

I had rolled back off before the disk array was repaired, the story being relayed later that a firmware upgrade reverse reordered every disk in 24-disk RAID groups and the data strangely became unavailable. That’s secondhand.

This was widely blamed as an early “cloud” failure but really didn’t resemble one; there was a central unified database cluster that had been created years earlier in startups mode of the time ... top end enterprise HW for key role, but not redundant / adequately backed up. SPOF.

The economic damage of writing off a 1.2 million customers smartphone brand and replacing all the devices over the next year has to have approximated a billion dollars.

Microsoft was blamed but the problem preexisted the Danger acquisition, and MS after some convincing had sprung for new hardware to (largely) cover addressing it. It had just reached the loading dock that Thursday.

Fragility of complex systems in all the fields I work in is a big problem. IT, where I still deal with mixtures of thousands of physical computers as well as cloud resources; aerospace, with space launch & spacecraft; nonproliferation & geopolitics, where stability is fleeting.

I have subsequently seen mistargeted projects waste more money than that, companies fail with lesser failures. Treaties collapse for stupid reasons unrelated to their goals and effects. The occasional exploding spacecraft. These things happen.

They should happen less.

The number of organizations actually working consistently to manage and engineer resilience at all levels is shocking small. If you’re an executive or board member, your business is more at risk than you probably have a handle on. Spend some time on that.

The NDA expired a year ago; this really needs to be a conference presentation sometime soon, but I’m not particularly enthusiastic about doing that via Zoom. When Covid has faded away.

If you’re working in any industry or organization or government, and you see something that could break and cause the organization to fail, and there’s no backup plan... call it out. People won’t always listen, but call it out and keep doing that.

Most of the time they don’t fail. Sometimes they do. Being right all along doesn’t wash the taste of ashes out of your mouth as the organization burns down. If you were fighting to fix it, that’s the best you could do. Fight for the fixes.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling