You can't talk about K8s scalability without etcd. Minimized the number of blocking read/write operations. This can happen if you have a lot of custom resources in a smaller cluster too (now available in @etcdio 3.4)
Networking with a medium-sized node cluster, apiserver has to send a lot of data due to Endpoints storing data per pod, quickly becomes an issue for deploys of larger services. EndpointSlices introduced to mitigate this (Beta in @kubernetesio 1.19)
Next up, Storage: Immutability to secret and confimap API so that kubelet does not need to watch for changes, reduces load on control plane (also @kubernetesio 1.19)
Golang memory allocation was also a bottleneck (!!!), benefits the whole ecosystem
Very important slide, K8s scalability is not isolated. The rest of your infrastructure needs to scale too (Note: I will also emphasis network integration!!!)
Next up: SIG App-Delivery presents how to solve everyday problems with the landscape (CI/CD, image build/app def, scheduling/orchestration) @resouer@AloisReitbauer#CloudNativeCon#KubeCon
BTW I'm also keeping an eye on the Kata Containers Performance on ARM talk! I'm very excited by the recent developments in lightweight VMs like @katacontainers as well as ARM arriving in the cloud via Graviton at @awscloud
I love interactive demos like podtato-head (github.com/cncf/podtato-h…), because I am a kinesthetic learner-- so staring at slides does not work well for me! You can also get a feel for the different UX of each technology
KubeVela is the recommendation, using Open application Model (OAM) with developer-centric primitives, but is also highly extensible. These are the kind of abstractions I like!
Next, evaluating more complex use cases (where all the challenges are!) Stateful workloads, databases, external dependencies, etc, all exist in the real world. Maybe I should check out SIG App-Delivery? 🤔
Next issue: Rolling out a new image, and Nodes became NotReady after a few days. Weird because not related to kubelet or runtime.
Why? `describe nodes` shows container runtime is down, but containerd is fine.
After looking at the kubelet logs, they find an error @lbernail
Noticed that there is lock logic in CNI and it is hanging.
Take a goroutine dump to find blocked goroutines.
Able to reproduce the issue with a blocked Delete command and tracked it to a bug in the netlink library used by the Lyft CNI plugin (wow, so low level!)
I wasn't able to watch @robertjscott's talk on AZ-Aware routing but read the slides; this is Very Important if you are running in multiple zones/regions!
When migrating to HPA, you can scale down to zero:
- when current replicas is 0, 1
- when migrating deployment logic (e.g. to Spinnaker)
- when using alternative "deploy" methods (e.g. kubectl scale)
K8s Master Issues:
- Do NOT drain the masters, unless you want to play 52 card pickup
- Running out of memory -> alerts firing everywhere -> everyone ssh loses access to nodes (o no...)
A single large service deploy that is CrashLooping (esp. with high maxSurge) is problematic
Meanwhile, keeping an eye on KubeVela, this looks really legit. Has anyone used it?
I'll be live tweeting #KubeCon#CloudNativeCon, and taking this year off for speaking. Excited to sit back and actually watch the content 😊. What talks are you attending?
@Lemonjet here to give a keynote on K8s @ Apple. They have MASSIVE data center scale. Looked to K8s for the pluggability, extensibility, and ecosystem. Unsurprisingly, they had to consider the learning curve and platform support to drive adoption. #kubecon#CloudNativeCon
Apple started by breaking down different users and workloads. Application developers, SRE (Note: easy to forget that infra teams are also your customers!), hardware, machine learning / batch, and finance / payments jobs. #kubecon#CloudNativeCon@Lemonjet