🧵 If you run @apachekafka in production, creating clusters, topics, connectors etc. by hand is tedious and error-prone. Better rely on declarative configuration which you put into revision control and apply in an automated way, #GitOps-style. Some tools which help with that:
1⃣ JulieOps (github.com/kafka-ops/julie) by @purbon, which helps you to "automate the management of your things within Apache Kafka, from Topics, Configuration to Metadata but as well Access Control, Schemas". A nice intro in this post by Bruno Costa: medium.com/marionete/how-…
2⃣ topicctl (github.com/segmentio/topi…) by @segment: "Easy, declarative management of Kafka topics. Includes the ability to 'apply' topic changes from YAML as well as a repl for interactive exploration of brokers, topics, consumer groups, messages, and more"
3⃣ #Terraform is a popular tool amongst many infrastructure-as-code adepts. No surprise that there is a TF provider for Kafka resources for it too: github.com/Mongey/terrafo….
4⃣ If #Kubernetes is your thing, take a look at @strimziio (strimzi.io), an open-source operator for running Kafka on Kube, which lets you manage clusters, topics, users, Connect, connectors, MirrorMaker and others, via custom resources and kubectl.
Quick 🧵 on what's "Head-of-Line Blocking" in @apachekafka, why it is a problem, and what some mitigation strategies are.
Context: Records in Kafka are written to topic partitions, which are read sequentially by consumers. To parallelize processing, consumers can be organized in
2⃣ groups, with partitions being distributed equally amongst consumer group members.
The problem: if a consumer hits a record which is either slow to process (say, a request to an external system takes a long time while doing so) or can't be processed at all (say, a record with
3⃣ an invalid format), that consumer can't make further progress with this partition. The reason being that consumer offsets aren't committed on a per-message basis, but always up to a specific record. I.e. all further records in that partition are blocked by the one at the head.
Project idea 1⃣: A stand-alone tool for compacting the schema history topic of Debezium connectors, allowing for faster start-up of connectors with large histories.
Project idea 2⃣: Porting the Debezium Cassandra connector to Debezium Server, allowing for a unified user experience across all the different connectors.
⏱️ Just ten more days until the release of @java 17, the next version with long-term support! To shorten the waiting time a bit, I'll do one tweet per day on a cool feature added since 11 (previous LTS), introducing just some of the changes making worth the upgrade. Let's go 🚀!
🔟 Ambigous null pointer exceptions were a true annoyance in the past. Not a problem any longer since Java 14: Helpful NPEs (JEP 358, openjdk.java.net/jeps/358) now exactly show which variable is null. A very nice improvement to #OpenJDK, previously available only in SAP's JVM.
9⃣ Varying load and new app instances must be started up quickly? Check out class-data sharing (CDS), whose dev exp has improved a lot with JEP 350 (Dynamic CDS Archives, Java 13); also way more classes are archiveable since Java 15. More details here: morling.dev/blog/smaller-f…
#Postgres as an event store -- Thanks a lot for all the super-insightful answers 🙏! It looks like using a jsonb[] for modeling an event stream isn't ideal performance-wise, but several great pointers to using #Postgres for event sourcing here. Mentioned solutions include... 1/4
A short 🧵 on @apachekafka topic creation (triggered by @niko_nava, thanks!): who should create Kafka topics, how to make sure they have the right settings, how to avoid dependencies between producer and consumer(s)? Here's my take:
2⃣ Don't use broker-side topic auto-creation! You'll lack fine-grained control over different settings for different topics; Merely polling, or requesting metadata, will trigger creation based on global settings. Plus, some cloud services don't expose auto-creation to begin with.
3⃣ Instead, the producer side should be in charge of creating topics. There you have the information and knowledge about the required settings (replication factor, no. of partitions, retention policy, etc.) for each topic. Depending on your requirements, different approaches...
Agreed, the term is sub-par. But hear me out, the architecture is not. Let's talk about a few common misconceptions about Serverless!
1⃣ "Serverless means no servers"
There *are* servers involved, but it's not on you to run and operate them. Instead, the serverless provider is managing the platform, scaling things up (and down) as needed. Less things to take care of, billed per-use.
Myth: BUSTED!
2⃣ "Serverless is cheaper"
Pay-per-use makes low/medium-volume workloads really cheap. But pricing is complex: no. of requests, assigned RAM/CPU, API gateways, traffic etc. Depending on your workload (e.g. high, sustained), other options like VMs are better.