In preparation for fixing broken Kubernetes clusters live on @rawkode's #Klustered event, I reminded myself of some of the core K8s debugging commands and techniques
Here are my top 10 tips for platform engineers debugging Kubernetes and the machinery underneath the covers 🧵 👇
First off, use kubectl to take a look at the cluster infra:
$ kubectl get nodes
$ kubectl cluster-info dump
These commands typically give you a good idea of where to start debugging, e.g. broken nodes, infra issues, resources
If kubectl doesn't work, it's time for some Linux debugging! SSH to the control plane node and run:
$ ps aux | grep kube
This will show you if core components such as the kublet, kube-apiserver, kube-scheduler are up and running
Checking out the kubelet logs will also generally provide pointers to underlying infra issues e.g. on a machine with systemd:
$ journalctl -xeu kubelet.service
In addition, you can typically find the logs to core components under /var/log/kube-apiserver.log or with kubeadm initialized clusters /var/log/containers/kube-...
Tail or less is your friend here:
$ tail /var/log/kube-apiserver.log
$ less /var/containers/kube-apiserver-...log
The @containerd command-line tool can also be useful to see if the control plane containers are up and running e.g.
Once you've got the control plane up and running, I like to look at everything running in the cluster (if my cluster workloads are small enough to not overwhelm my terminal). Look for CrashLoopBacks or pods not starting:
$ kubectl get pods -A
If you find anything, then it's time to go spelunking in the associated Pod or Deployment config and/or logs.
Getting a list of events can also be useful (and you will have seen some events with using the describe commands above)
$ kubectl get events --sort-by=.metadata.creationTimestamp
Look for obvious issues (image missing, resource issues e.g. OOM kills, network connectivity)
If you have access to the original manifests, the diff command can be super useful to see if someone has been tampering with your config in the cluster (accidentally or otherwise)
$ kubectl diff -f ./my-manifest.yaml
I could write an entire other thread on debugging networking and storage issues, but I'll save this for another day
As for general reference, the Kubernetes docs have some very useful additional guidelines for debugging cluster, too: kubernetes.io/docs/tasks/deb…
And I found Ioannis Moustakis's Certified Kubernetes Administrator (CKA) exam cheatsheet super useful for commands (even if you're not studying for the exam!): faun.pub/cka-kubernetes…
"Platform Engineering" is rapidly becoming the new DevOps or SRE. Almost every day we hear about another org building an internal developer platform or control plane.
Want to know what platform engineering is, where the trends are going, and why you should care?
Read on 🧵👇
We've all been building application/web platforms for years
- On-premises: ticket-driven, bare-metal, long lead time
- First-gen PaaS: self-service, VM-based, one-size-fits-all, on-demand
- Next-gen PaaS/Custom platform: self-service, container-based, fast feedback, good UX/DevX
The rise of more ops-savvy developers (and SREs) and developer-friendly infra tooling has led to a boom in the creation of custom platforms
The attraction to building a custom platform is that you can craft the abstractions to match exactly what your org requires (in theory)