061. BSoD to Watson: The Reliability Journey …rdcoresoftware.learningbyshipping.com/p/061-bsod-to-…
Super excited to share this set of stories about software quality, I mean blue screens & crashes. Ever wonder what happens when you click "Send to Microsoft?" Does it matter? Where's it from?Who invented it? 1/
2/ In the Windows 95 / internet era when so many people started with computing, "crashing" was a thing computers just did. You'd be working away on a word processor or paint program and 💥 the PC would freeze or worse.
This happened on Macs too. Mac had a very graceful fail :-)
3/ When I started writing, I wanted to go through the entire history of how the PC handled crashes. But along the way, I realized what was fundamentally a user-hostile event just got more hostile over the years.
4/ Who was General Protection in what army?
How "illegal" was the "operation"?
Was the exception really "fatal"?
These and other Ups led to internet "memes" (not the phrase back then) like this Tools Options Crash.
And of course the defining "Blue Screen of Death" in NT 3.1
5/ Turns out back during Windows 3.0 a clever engineer on the Windows team invented a tool called "Dr. Watson" (originally called Sherlock, but that name was used though a decade before Mac).
Watson captured some minimal but critical system info that could be shared w/Microsoft.
5a/ Closely related, though not obvious, was development of sophisticated "Undo".
A big problem with s/w was how "destructive" operations (editing!) led to defensive use (saving, backups, etc) which stressed the system. Undo was a first step in dramatic quality improvements.
6/ There was no internet quite yet so it sort of sat there for a few years.
Then right around holiday 1998, an engineer on Office (a hacker's hacker, Kirk Glerum) had the insight to connect Watson to the internet. he wrote a memo, which was weird b/c he mostly wrote MASM.
7/ Just a brilliant idea. It seems so trivial / obvious now, but before then software didn't do this (there were examples of copiers that signaled errors over phone lines and mainframes did some of this).
Very quickly the team jumped on this idea. Every crash was a data point.
8/ Many more details about how this changed our culture, what was interesting about in terms of customer experience.
I always thought of it as a dramatic change in *computer science*. It turned fixing bugs at massive scale into a solvable problem. My college recruiting prez.
9/ The most important thing we learned quickly was the 80/20 rule—80% of the crashes happening (in the real world) were caused by just 20% of the bugs. In fact just a few bugs were *half* the crashes. This "Pareto" distribution was dubbed "the Watson curve".
10/ The team even wrote a paper that was published in Communications of the ACM "Debugging in the (Very) Large: Ten Years of Implementation and Experience".
11/ This was the start of a dramatic change in software quality (yes I realize people will make jokes!) It is hard to put into words how fundamental this was to software engineering.
Of course this is part of every mobile platform today but it is amazing to think of the start.
12/ I tried to capture many of the details of this evolution. It was pivotal point in the history of engineering brought on by the internet.
We followed this with "watsonizing" everything: feature usage, help topics, spelling, and more. (Prez from 2004)
13/ I wanted to use this first post of 2022 to thank everyone who has been along for the journey of "Hardcore Software". I can't thank enough the over 200,000 unique readers. It is so amazing to be able to share these lessons and history.
Thank you 🙏
14/ Please consider subscribing for 2022. We're only halfway through and will soon be covering topics like SharePoint, .NET, NetDocs (!), the Ribbon, Courier (!), Windows 7, Windows 8, Surface, and so much in-between. hardcoresoftware.substack.com
PS/ I love the personal reflections shared by the members of the teams in "Hardcore Software". Here is Kirk Glerum on what it was like to build and stand up Watson. (ignore the "Mr.", that's Kirk 🙃).
Subscribe and check out more comments.
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.