12,399 views

Micah Hausler

@micahhausler

, 29 tweets, 7 min read

My Authors

I often hear people ask why Kubernetes and Firecracker (FC) can’t just be used together. It seems like an intuitive combination, Kubernetes is popular for orchestration, and Firecracker provides strong isolation boundaries. So why aren’t they compatible yet? Read on 🧵

First a brief explanation of Firecracker. Firecracker is a virtual machine monitor (vmm) written in Rust (read: cool) that was open sourced by AWS in 2018. I _highly_ recommend reading the FC paper for a more thorough explanation of what it is and is not. amazon.science/publications/f…

The short version is that FC prioritizes isolation/security, density, compatibility (w/ Linux APIs), speed, and performance. FC manages VMs and Kubernetes manages containers, and certain Linux container features don’t exactly translate to VMs. We’ll get to specifics shortly

In order to run a microVM (unit of work in FC), you interact the FC API over a unix domain socket. The FC API is restful and you can find the OpenAPI docs online, but the Kubernetes Kubelet (agent) doesn’t know how to talk to this API. github.com/firecracker-mi…

What the Kubelet _does_ know how to do is speak to a CRI API (Container Runtime Interface) to run a pod, or a group of containers together. Then how does Kubernetes talk to a runtime like containerd you ask? Good question! It doesn’t, or at least not directly.

It talks to the cri-containerd plugin embedded in the containerd runtime. containerd has the ability to run plugins, and these plugins can translate requests from other APIs into actions containerd knows how to do github.com/containerd/cri

Since Kubelet only knows CRI and adding support for Kubelet to directly talk to additional runtimes is a non-goal for Kubernetes, this means we need something else between containerd and Firecracker. Enter firecracker-containerd github.com/firecracker-mi…

“Ok so since Kubelet can talk CRI to containerd, and containerd can interact with Firecracker, what is the holdup?” Another astute question. The short answer is that like using Google Translate to go from Farsi to English to Chinese things get lost in translation.

Just like language, there isn’t always a direct translation between defining a Linux container and a microVM. One of the first hurdles is handling the workload lifecycle.

First off, containerd doesn't have a built-in notion of a group of containers today. cri-containerd assembles its own groups by namespace sharing with the pause container (if you've ever run `docker ps` and seen a pause container per pod, this is it)

An approach to getting around this for firecracker-containerd could be when a pause container gets created, Firecracker could create a microVM, boot a Linux image, and start containers inside the VM using runc. (This is how firecracker-containerd works with Fargate)

But because these containers run inside a microVM and due to the lack of grouping, there isn’t an explicit event signaling that the whole microVM needs to be torn down. So again, we would have to infer this from the pause container.

This would give us the following path for managing the VM lifecycle:
kubelet->cri-containerd->containerd->firecracker-containerd->Firecracker
And for the individual container lifecycles:
kubelet->cri-containerd->contanierd->firecracker-containerd->[microVM boundary]->runc

Next, lets talk about Linux containers. A container is not a primitive in Linux, it is composed of cgroups (resource constraints), namespaces (limiting what can be seen), and layered filesystems (think chroot), and all containers on a host share the same kernel.

You can think of a container as a regular Linux process with some extra, optional configuration added to it. You can use some of that configuration without using all of it

When you create a Linux container you don’t have to specify something like CPU or memory limits because without them a container will still work, it just won’t have limits. It will behave like any other non-containerized process and eat as much CPU/memory as the kernel will allow

In order for Firecracker to create a microVM, you must specify resource limits for CPU and memory. (See MachineConfiguration in the FC API) When kubelet talks to a CRI runtime, it first calls RunPodSandbox _without_ the ability to specify sizing information

“But I can set requests and limits on my pod spec don’t I?” You can, but those fields are optional and on a per-container level. In CRI these values are communicated on subsequent CreateContainer calls referencing a pre-created PodSandbox
github.com/kubernetes/cri…

This, critically, is the second point of contention for putting all the pieces together. We really need the kubelet to be able to specify the whole pod size (CPU and Memory) up front. The short explanation is that CRI is modeled around Linux containers, not microVMs.

We (AWS) have had conversations with the folks responsible for maintaining containerd and CRI about this, but a long term solution that fits microVM requirements has not been figured out yet. If this interests you, please join in the conversation!

So let’s wave our magic wand and say CRI gets updated to specify the whole pod size up front, or there is some MutatingAdmissionController that annotates the pod with that information and it gets plumbed down into the runtime, why is that not enough?

Well again, kubelet is closely modeled around Linux containers, and so lots of functionality is also tied to that. Remember the note above about how Firecracker prioritizes isolation? That extends to storage. From the Firecracker paper:

What this is saying is “We can’t guarantee the security of a shared filesystem, so we just don’t share one.” The first obvious implication of this is that hostMount volumes won’t work in Kubernetes. Less obvious is that most other volume drivers won’t work either.

At least not at first. Particularly painful would be the initial inability for kubelet to mount secrets, configMaps, ServiceAccount tokens, in-tree NFS, and more. The kubelet performs all these by using the mount syscall and sharing host files or directories into the container.

Now most of these things can be overcome, but the channel between the kubelet and the container wouldn’t be syscalls manipulating a shared kernel and filesystem (between the kubelet and the pod container).

While FC doesn’t share a filesystem with guests, a socket can be created so that a host agent and guest agent can communicate and perform actions like file or secret injection.

Note that you can use Kubernetes today with Kata containers, and Kata has some limited support for using Firecracker as a VMM to run containers, but it is still limited by the above issues and doesn’t support CPU/Memory limits or shared files.
github.com/kata-container…

@samuelkarp

@samuelkarp

firecracker-containerd is written by good folks like @samuelkarp @nmeyerhans @mak_pav and others. For a deeper dive into the technical bits of all this and other challenges, watch this great talk Sam gave at DockerCon 2019 about firecracker-containerd.

To sum it up: CRI doesnt let you specify resources upfront, grouping isnt supported in containerd, host file sharing for a microVM doesnt work like kubelet expects. These are surmountable, but community willingness and contribution will be required. Id love to see it happen! /end

Enjoying this thread?

Try unrolling a thread yourself!

Trending hashtags

Enjoying this thread?

Try unrolling a thread yourself!

Related threads

Trending hashtags

Embed code for your website

Did Thread Reader help you today?