"can we make a dashboard so we can find the problem immediately next time?" and
"what alert can we set up, to notify us when this happens?"
does everyone even use the same ones, or are you fragmented?
the frame you want for modern systems is debugging, not monitoring. if you can't spot a problem in a glance, shouldn't try.
it's not hard. but it's an exploration game. you follow the breadcrumbs where they take you.
no, you should not add a monitoring check for every system state that sometimes represents a problem worth escalating about
no, not another dashboard. just no.
but you aren't going to notice many (if not most) of the bugs or problems, and the badness needs to rise to a certain level to even be worth your time.
dashboard-flipping isn't science. with science you ask questions -- you formulate a hypothesis, you test it. you follow your bread crumbs where they lead you.
every dashboard is an artifact of some past failure, and the data sources may or may not even be working, and your team's entire view of the world has fragmented. so just fuck dashboards.
send nearly all "alerts" to a non paging source with an SLA of hours, not minutes.
it probably won't be worse. it might be better.)