Or how many *can't* actually be explained or understood, given existing telemetry.
"Huh. Well if it happens again we will *definitely* need to figure out what's going on." ~everyone
You're trying to debug why an event failed, and all you have are time series aggregates and metrics where the context has been stripped away and discarded.
Another reason we can't explain our outages is that usually all we have are dashboards, and the scars and memories of past outages.
Can you imagine if we debugged lines of code this way? By thinking hard??
For software you own and instrument yourself, this should bend asymptotically over time toward a debuggable system.