Data tweeps: I'm trying to get an overview about players in the space of incrementally updated materialized database views. That field is absolutely exploding right now, and it's really hard to keep track. Here are the ones I'm aware of 👇:
1⃣ @MaterializeInc (materialize.com): Definitely the most prominent one, Postgres-compatible, based on the Timely/Differential Dataflow algorithms. Business Source License.
4⃣ @readysetio (readyset.io); specifically targeting caching use cases, but it's also incremental view materialization (based on Noria). Business Source License.
5⃣ @leap_db (leapdb.com). MySQL-compatible. Not quite clear on the license, seems to be SaaS exclusively?
6⃣ pgsql-ivm (github.com/sraoss/pgsql-i…); an extension for incremental view maintenance within Postgres itself. May become part of PG proper some day. Not clear on the license, I suppose PostgreSQL License?
7⃣ Besides all these above which are positioned as databases, there's multiple streaming SQL solutions, but I think it's a separate solution space, e.g. with @ApacheFlink SQL (e.g. via @Decodableco), @ksqlDB, and @DeltaStreamInc.
Those are the ones I'm aware of right now; would love to learn about other view mat you may know. Would be cool (but tons of work) to have a blog post with a thorough comparison, e.g. exploring the specific query capabilities and consistency guarantees. One day, perhaps :)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Got asked how stream processing platforms (e.g. Apache Flink, Kafka Streams, Spark Structured Streaming) compare to streaming databases (e.g. RisingWave, Materialize, PranaDB). There's some overlap and similarities, but also differences. Here's some aspects which may help 1/10
you to pick the right tool for the job. First, the commonalities: both kinds of tools let you do (potentially stateful) computations on (un-)bounded streams of data, such as click streams, IoT data streams, or CDC feeds from databases: e.g. projecting, filtering, mapping, 2/10
joining, grouping and aggregating, time/session-windowed computations, etc. A key value proposition is to give you deep insight into your live data by incrementally computing derived data views with a very low latency. E.g. think real-time analytics, fraud and anomaly 3/10
🧵 "How does Apache Flink compare to Kafka Streams?"
Both do stream processing, but differ in some important aspects. A few folks asked me about this recently, so I thought I'd share some thoughts. This is from a user's perspective, not touching on implementation details. 1/10
1⃣ Supported Streaming Platforms
Being part of the @apachekafka project, Kafka Streams exclusively supports stream processing of data in Kafka. @ApacheFlink is platform-agnostic and lets you process data in Kafka, AWS Kinesis, Google Cloud Pub/Sub, RabbitMQ, etc. 2/10
2⃣ Deployment Model
Kafka Streams is a library which you embed into your Java (or more generally, JVM-based) application. Flink can be used that way, too, but more typically it is run as a cluster of workers to which you upload your jobs. It comes with a web console for... 3/10
🧵 Few things in a developer's life are as annoying as issues with their project's build tool. A build running just fine yesterday is suddenly failing? Your build is just so slooow? A quick thread with some practices I've come to value when using @ASFMavenProject 👇.
1⃣ Make sure the default build (`mvn verify`) passes after a fresh checkout. It's so frustrating to check out a code base and not be able to build it. If special tools need to be installed, have custom enforcer rules (see below) to verify and error out on this eagerly.
2⃣ Pin all dependency and plug-in to specific (non-snapshot) versions. In particular for plug-ins, that often gets forgotten, resulting in potential surprises for instance when using different Maven version.
🧵 If you run @apachekafka in production, creating clusters, topics, connectors etc. by hand is tedious and error-prone. Better rely on declarative configuration which you put into revision control and apply in an automated way, #GitOps-style. Some tools which help with that:
1⃣ JulieOps (github.com/kafka-ops/julie) by @purbon, which helps you to "automate the management of your things within Apache Kafka, from Topics, Configuration to Metadata but as well Access Control, Schemas". A nice intro in this post by Bruno Costa: medium.com/marionete/how-…
2⃣ topicctl (github.com/segmentio/topi…) by @segment: "Easy, declarative management of Kafka topics. Includes the ability to 'apply' topic changes from YAML as well as a repl for interactive exploration of brokers, topics, consumer groups, messages, and more"
Quick 🧵 on what's "Head-of-Line Blocking" in @apachekafka, why it is a problem, and what some mitigation strategies are.
Context: Records in Kafka are written to topic partitions, which are read sequentially by consumers. To parallelize processing, consumers can be organized in
2⃣ groups, with partitions being distributed equally amongst consumer group members.
The problem: if a consumer hits a record which is either slow to process (say, a request to an external system takes a long time while doing so) or can't be processed at all (say, a record with
3⃣ an invalid format), that consumer can't make further progress with this partition. The reason being that consumer offsets aren't committed on a per-message basis, but always up to a specific record. I.e. all further records in that partition are blocked by the one at the head.
Project idea 1⃣: A stand-alone tool for compacting the schema history topic of Debezium connectors, allowing for faster start-up of connectors with large histories.
Project idea 2⃣: Porting the Debezium Cassandra connector to Debezium Server, allowing for a unified user experience across all the different connectors.