Are we going to have batch and streaming forever, or will they converge? @esammer says at the heart of systems lambda arch will go away and kappa will eventually win out. Once in DW perhaps batch will remain for its familiarity to analytics engineers.
@notamyfromdbt - Microbatching gets used to simulate streaming but with same toolset for familiarity, but it doesn’t scale
What’s holding people back from streaming? @takidau says streaming answers everything, but the complexity is currently a challenge. Aslo low-latency isn’t always so necessary to be worth the effort
Is streaming a superset? @esammer says in theory yes. However there are some optimisations that you can do in batch that are easier than in streaming (eg checkpoints aren’t needed in batch)
@notamyfromdbt notes that there isn’t a consensus or best practice from the streaming side of things yet unlike the modern data stack on the batch side (echoing @bennstancil’s point from his talk earlier - make it boring)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Dan Sotolongo at #current22: RDBMS and SQL have stood the test of time. Sets the scene for stream processing by covering core concepts of tables and steams
#current22 handling event time joins in SQL using functions.
The next problem is making sure we have all the data. It’s watermarks, but not really
Having watched @gwenshap and @ozkatz100 talk about "git for data" I would definitely say is a serious idea.
However to the point at the end of the video, RTFM—it took reading docs.lakefs.io/using_lakefs/d… and some other pages subsequently to really grok the concept in practice.
Where I struggled at first with the git analogy alone was that data changes, and I couldn't see how branch/merge fitted into that outside of the idea of branching for throwaway testing alone. The 1PB accident was useful for illustrating the latter point for sure.
But then reading docs.lakefs.io/understand/roa… made me realise that I was thinking about the whole thing from a streaming PoV—when actually the idea of running a batch against a branch with a hook to validate and then merge is a freakin awesome idea