1. Random latency talking over network
A: check disk IO on the host, you’re probably exceeding the IO levels on the OS disks. I bet it’s disk
2. My cluster goes down during an upgrade
A: set a pod disruption budget
Check your CPU and memory limits in your yaml. Confirm the request limits are correct for your app and the back pressure from throttling at the container level does crash your app
Double check you’re not blocking or your IT group is not auto banning ports, urls, etc (eg port forwarding being blocked by Dave’s firewall in the IT closet)
Then run a few 1gb+ containers under load. Container size matters. Also nfs and cifs mounts
Latency talking to the SPI server... probably not network
Disk pressure/IOPS overload.
planned for disk space, pvc space, mem and cpu but not sheer IO(ps).
Even nvme/buses/caches have limits and those amplify at scale *under load* and will *look* like a software/host/managed service failure.
You should treat the os disk as inviolate and move all IO like docker, scanners, monitoring/storage, etc *off* the OS path. Give it as *much* space as you can in terms of IO