Tweet

Francis Stokes

18 Jun, 61 tweets, 9 min read

Seeing the HBO email mishap got me thinking about about the fuck up I once made that I still think about to this day.

A 🧵 that I hope will illustrate some of the factors that can go into these kinds of events.

I'm keeping everything, including the companies, anonymised. I know some people on here know me personally, and may have even worked with me on these project, and I'd ask you not to name any names either.

I was still pretty junior, working as a contractor for one of the worlds biggest technology companies. I was part of a small prototyping and piloting division that worked outside of (or at least around) the corporate bureaucracy. The whole sell of this team is that they could get

an end-to-end idea up and running extremely fast and cheap. We were able to do this in part by leveraging our existing codebase, which had been built up over quite a few years. We could take our general architecture and throw it at different problems, and 9/10 times we'd be able

to hack something together that would meet the goal (and we'd have a little bit more code to leverage in the future).

Keep in mind, this is still mostly internal to the company - so our team and others would basically be competing to win the budget money of other teams

throughout the organisation This atmosphere was both awesome and awful.

Awesome because we were really streamlined and really did get stuff done in almost no time.

Awful because we did this by cutting every possible corner, including all forms of testing, and by always being on call to hotfix pilots in production, day or night.

The team consisted of a pair of iOS developers, 2-3 web front end devs, 1 backender, a product owner, and visionary team leader who had quite a significant position in the organisation. It also included an additional frontender and backender, both of whom worked remote, but for

various reasons were treated very much as a separate entity, and not part of the core.

I had joined the team a frontender with pretty limited experience (I actually studied game dev at uni), but part of my growth plan was to move into backend development. And in this team,

backend basically meant "everything that wasn't frontend or mobile".

I'd been part of of the team for a while, and had found my feet in the frontend. We were working on a new pilot - which was technologically more complex in all dimensions than anything we'd done before. I was onboarding a new dev, at the same time as setting up a whole new

architecture for our web app.

Our backender was working hard to make sure all of the various databases, server code, and scripts were able to fit this new pilot, and it was a big job - made especially hard because of the lack of tests. At the same time, he was very slowly

teaching me bits and pieces - sharing insights about the codebase and practices. I had been told that a proper onboarding would take place at some point, and that he would gradually move off this project and I would replace him.

Great! Except that one day, my boss from the contracting company dropped the bomb that it wasn't going to happen in a few months, it was going to happen *today*.

And it wasn't going to be gradual, it was going to be "he's not working there at all anymore, but will be available for questions you might have".

Gulp. We were in the middle of a project, and I had my hands completely full with the FE. All of a sudden, I was also in charge of the backend. As it turned out, the old dev had been burning out due to stress, and had all but threatened to quit if not removed asap.

I can completely understand that, but I felt a bit like I'd been bait and switched.

And very quickly, I began to see why he was so stressed. The server infrastructure was not what you would call modern.

Making changes often required going through mistrustful, faceless IT admins via email in other parts of the company. We had servers that were for testing, staging and production, but also servers that were a mix of all of these.

That meant it was very difficult to understand the impact of operations, and made every mutable DB query into an anxiety nightmare.

The codebase was huge, and a mess of legacy and hacks, stitched together with undocumented interfaces. We had databases, caching layers, and files on the disk that were incredibly interdependent on each other, and which formed some kind of monolith state machine.

It wasn't clear if the code you were working on was important or vestigial.

Every new pilot would add some project specific scripts, which would help us out in the development phase - but would also often do critical state operations during the running of a pilot.

They all had innocuous names, and were copied/pasted/modified from older scripts. And we could be running 3 or 4 pilots concurrently sometimes, so there was a huge amount of context switching involved.

t this point I want to point out that this wasn't any developers fault - especially not my predecessor. The pace and pressure were just so high in this environment, that this was the only way to achieve the results that were demanded. He had begun, against the wishes of the PO

and team leader, to write tests that would document the system - and had done some significant refactoring to improve important areas of the code. It was, however, a bit of a drop in the ocean so to speak.

Aside from those factors, I soon found out that a culture had been fostered around the backend role that this person would be responsible for anything that the other developers felt was outside their purview. This meant every other developer would come with a bespoke request

every day - which could quickly add up to eat an entire day. Many of these requests were also expressed as "absolutely top priority" or "completely blocking". Every day. They could be anything from a bespoke API that made no sense in a wider context, and

would just add to the legacy, to server operations that, as I previously mentioned, could be absolutely terrifying (SSH and yoloing like it's 1999).

So we were in more than halfway into the development of the latest pilot, with the old backender gone (and as it turned out, not very available for questions), and I still had all of my FE responsibilities to boot. My sleep was affected, and I would lay awake at night

dreading the next day. I felt like I was treading water. But somehow, I *was* making progress. The pilot looked like it was coming together. And I had a 3 week holiday fast approaching, which was helping to keep me sane.

It was the day before were going to production (also the day before my holiday) - still hacking to the very last minute - and we were in the morning standup. During these standups I would be frantically hacking away, trying desperately to keep up with the seemingly never-ending

demands I was faced with - when one of the mobile guys told me he needed me to run the pilot reset script so they could do some last minute testing. And of course, it was a critically blocking request, which I should drop everything for.

So during this standup, I opened a terminal, SSHd into the server, and ran the script, and told him then and there it was done.

In that moment, I actually felt like I had shit under control. Like I was truly stepping into my predecessors shoes! But that feeling vanished very quickly. Just after the standup slack messages starting dropping in the team chat.

"I'm not getting any response from the <blah> server. None of my devices are connecting"

"I can't see it on <XYZ> frontend either."

"Francis you need to look at this asap."

They were referring to stuff outside of the project were crunching to finish, and I was just annoyed to have to context switch yet again. But then a DM came in.

"There's no data in the <blah> database"

The <blah> database was a kind of testing/production hybrid database, full of very domain specific data we'd be collecting for years. It was considered an enormous asset to the team, and there were big plans for it related to ML and AI applications.

Now, I only peripherally understood this at the time. I had thought it was just a test server (since we were using it that way). But I also knew it was me who had fucked up. The pilot reset script - which was written by the old backender - was apparently

meant to be run ONLY on the actual pilot server. I opened it up, and after digging through the various abstract calls it made into other parts of the codebase, realised that it did in fact, amoungst other things, perform a drop of all data in the DB.

No warning, no confirmation, no output in the terminal. Just *poof* gone.

The heart palpitations started immediately, and I all but sprinted into a meeting room by myself. I couldn't breath. I sent out some slack messages saying that I was looking into it and would update as soon as I knew something.

The PO and other devs were virtually breathing down my neck the entire time.

Now, it shouldn't surprise you at this point to find out that we didn't have any backups - but it certainly surprised me!

Later I found out, according to the PO, backups were expensive, both to store and to maintain - and that he was sure that data could be recovered "somehow" by the faceless IT department. I'm not so sure about that.

So I thought this was it. My company would lose this client because of my mistake, and I'd likely be fired as a result. Filled with panic, I did the only thing I could think to do, which was to call my predecessor. He was on vacation at the time, which I felt even worse about,

but he did pick up. I explained everything, completely panicked, and he listened calmly, nodding along without any condescension at all.

In some kind of deus ex machina moment, he told me he thought that he actually had a backup he'd made a few weeks before he left the team, when he needed to clone the DB somewhere else. And within a few minutes he'd found it, sent it to me,

and I was able to get it back up and running. The whole thing was down for perhaps 2 hours total.

I should even make clear here that this backup would have been considered unauthorised, and probably in breach of NDA. But there was no ill intent here, no cloak and dagger, just a pure saving grace.

I thanked him the best I could in the state I was in, and went back to slack to update everyone. I framed the event in the vaguest terms I possibly could, stating that we'd lost a few weeks of data, but nothing too serious. It would have been better to be completely transparent,

but at the time I just couldn't. I was still scared about losing my position, as well as my standing in the team if I did somehow manage to stay employed. There was a bit of grumbling about the data loss, but in the end it didn't really matter.

The next day we launched, and I stepped on the plane, and while I felt relief, it took several days for me to return to a state of non-anxiety. And on the return flight everything came back up again as I imagined what conversations might be waiting for me.

Thankfully, there weren't any. They were all just happy the gears of the machine were still turning.

I learned a lot from that incident. The biggest lesson of course is that you need to be thoughtful and take your time when it comes to operations - especially when the infrastructure you have doesn't have any guard rails.

You make mistakes when you rush, the rushing is the natural reaction when pressure is applied. You need to unlearn that, and I certainly did! I also learned the importance documentation, proper procedures for on/off-boarding, and planning ahead for change.

And in hindsight, and with more experience with other teams, I learned that the way we had things set up there was a ticking time bomb. There was so much firing from the hip there - so much absolute and mandated yolo behaviour - that this was bound to happen.

From the top down, these bad practices were built into our working process as "acceptable" risk. They were encouraged, even demanded, and requests to add more safety or careful procedures were shut down because of costs.

I refuse to work like that anymore, and I make same sure to highlight and escalate any perceived risks so that it's clear where responsibility will lay if my advice is ignored. I'm always trying to push for better documentation and onboarding procedures because I never want

anyone else to experience the sheer dread that I went through. I feel lucky that in a lot of the teams I've worked in since then, communication and openness are valued and encouraged, and that people can face difficulties together instead of feeling isolated and cornered.

I hope that was interesting - it's honestly taken me a long time to get to the point where I don't feel ashamed to share this. Every single person makes mistakes. I'm still making mistakes!

But you have to turn those mistakes into knowledge and intuition, learning from every one of them, and trying hard not to make them again (even that can happen!).

</🧵>

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Francis Stokes

Try unrolling a thread yourself!

Did Thread Reader help you today?

Like this author's thread?