How does one resolve the emerging consensus for alerting exclusively on user-visible outages, with the undeniable need to learn about and react to things +before* users notice? Like a high cache eviction rate?
Oh and don't forget AIOps. Definitely some of that.
These alerts should be few, high level, and directly correlate to user experience.
As well as a common intersecting variant, daytime hours and nighttime hours.
Making on call not suck is actually less about reducing the load to zero alerts, and more about making it so none of them are urgent.
(It is ☺️)
I strongly advise against having more than two sources of alerts.