also! watch out for the @o11ycast recorded today with @rachelmyers and @eanakashima, wherein @lizthegrey and I debated to-log-or-not-to-log in 🌟exhaustive🌟 detail)
and ideally providing you with enough of the context and system state whence the error occurred, so you can repro it locally.
It is at least as likely to TAKE YOUR SYSTEM DOWN as it is to explain your problem.
* filling up the local volume
* saturating your iops or nas
* ddos'ing your central log store
* saturating your nic
* causing new or exacerbating existing race conditions
* filling up any number of buffers and eventually your RAM
*... shall I go on?
Now here is where liz and I begin arguing vigorously...
she agrees, but says that based on her experience running large multi tenant systems, logs are sometimes the only thing that will shed light on contention for shared resources.
"nonsense," I say, "based on MY experience
The full argument is on the podcast. ☺️ The difference in our experience turned out to hinge on -- actually you know what, I won't spoil it. 😈
Deprecated. Smelly. Means there is something missing, or something wrong. But useful as all fuck if you're out of clues.
You should be using your tools to explore and understand your systems. Inspecting a single host means your tools failed you. When you smell this, you should fix them.
Hrm. Does that make sense to you? 🤔 I can't tell how many folks this distinction resonates with.
Implicit: you do not know the answer when you set out, you go where the data takes you.
We are far too used to solving operational problems with guesses and pattern matching.
Instead of iterating, you search for that one magic string you remember, or grep for a user ID you remember that's a good canary for the problem you suspect is what's happening.