My Authors
Read all threads
After magistral @obeattie keynote 2y ago at #KubeCon #CloudNativeCon EU in Copenhagen I couldn't miss today @milesbxf & @suhailpatel talk about war stories of running @monzo on @kubernetesio in production. And oh boy am I not disappointed. So incredible talk. Pure gold. Thread ⬇️
Let's start with the global landscape at Monzo:

- One single @kubernetesio production cluster
- 600+ microservices, all written in @golang
- Hundreds of nodes
- Almost entirely on AWS
- From @Linkerd to @EnvoyProxy for service mesh
- From flannel to calico for network

#KubeCon
They wanted ALL services to explicitly define what they talked with, so made a huge usage of Network Policies to specify ALL ingress and egress connections of every service.

Under the hood: Calico to enforce the network policies by generating iptables rules ✨

#KubeCon
They also wanted to dry run these Network Policies before enforcing them, to avoid blocking things and making users angry. So they used iptables logging option to log packets that would have been dropped due to missing rules 🤯

#KubeCon
So at first, they began to log packet drops without enforcing them. And what would our world be without awesome graphs & metric to measure how things are going and what's the part of work that remains to be done ! 📊

So they started measuring drops. So f***ing clean.

#KubeCon
Then... they starting having some sporadic errors (mainly due to apps talking to outside the Kubernetes cluster if I got it right). Why? Because there was a LOT of things to drop, and node weren't able to keep up with the logging of non-existing rules, resulting in high CPU usage
Actually, EBS was used a root volume, and iptables was therefore logging over network attached storage. Kernel thread blocked while waiting confirmation. Do this for a tons of drop logs, BOOM 💥

#KubeCon
Since they are running on AWS, they suspected that AWS networking stack might be the problem. Nope, everything good on @awscloud side (with apparently a quick response from AWS support, guess not everyone has the same priority 😉)
Finally found out that iptables logging was exhausting all kernel threads, and metrics correlated with this. So they had to abandon logging, but didn't want to lose the really precious and useful infos they contained!

They needed to find some balance.

#KubeCon
At this point they were running fully on Calico, and approached the problem in two phases:

- They built a custom (open source ❤️) tool to take the Calico metrics and put them into Prometheus : github.com/monzo/calico-a…
- When rate of packet drop went down, turn log mode back on
They *really* wanted the raw logs because they gave much richer information about the source of traffic than just Calico metrics.

#KubeCon #CloudNativeCon
Btw did I mention that it's the first talk at a #KubeCon #CloudNativeCon that mentions Arch Linux?

Mandatory "I use Arch btw" I guess 😂 (hey, don't look at my Twitter bio.)
So, here was a quick war story about what @Monzo is experiencing by running @kubernetesio in production. But there is more! The questions where filled with interesting bits that I'll try to dilute in the rest of the thread (bare with me, not used to Twitter threads...)
Is Monzo totally in the Cloud (AWS)?

Nope, but almost. There still are some physical DC where they handle things themselves due to needed interconnection with bank systems (I think), but that essentially route things to AWS. They have a direct physical connection from DC to AWS.
They use Calico OSS, not Tigera enterprise. And they seems totally satisfied with it 🙂
Regarding banking certifications: they do not tell you what technology to use. Just the objectives in terms of security, isolation, etc. You "just" have to prove them that you are achieving the spirit of what they need, whatever the layers of abstraction you use! 💪
Databases on Kubernetes?

@Monzo DB of choice is Cassandra. They don't run it on K8S, but part of the exciting work was to make Cassandra more k8s-aware! Unlike many other DBs, a key component is that things such as IP address need to be mostly static....
.... which, you guessed it, makes thing a bit more complicated when IP addresses under the hood are constantly changing but they are still used to identify stuff. Anyway, they seemed to be really excited about the direction the Cassandra community is taking regarding this
I could NOT avoid mentioning this: their Disaster Recovery strategy. See, they are running their own Kubernetes clusters, but one of the big benefits of K8s is that there are now a lot of managed offers and it's really easy to test against these.
Since EVERYTHING is done with declarative infrastructure (manifests & IaC), they have everything ready to be deployed in the same fashion on something else. They are exploring EKS for exezmple, and can spin up the exact same microservices with the same net plugins, netpols, etc.
Btw, this is the graph of their microservices. And they have dynamic tooling to dynamically spin up a dependent microservice when it's needed, to make developers life sooo much easier.

Call me back when you have something as awesome as them. I'm waiting.
Let's wrap up with the main point, and maybe most beautiful thing in all their approach, mindset and result: all of this made them learn a ton of things, which will still be useful even on other platforms. Such maturity.

Congrats and thanks for the talk, it was awesome 🙂
It's really cool to see a bank using modern technologies with such a positive and innovative mindset. Can't wait for the next talk from @monzo !

@milesbxf @suhailpatel feel free to correct anything I got wrong, of course 😉
Missing some Tweet in this thread? You can try to force a refresh.

Keep Current with Alexis "Horgix" Chotard

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!