Tweet

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

Follow @mdavidallen

19 May, 20 tweets, 5 min read

@neo4j

Batch vs. streaming data ingest into #graph and .@neo4j

(mini thread)

So the main typical tradeoff is latency. Batch when you need fresh data in larger volumes, say once per hour/day/week/month

Stream when time value of data is high/immediate and you can't afford to be more than minutes behind

The overall event queue (so to speak) that's being ingested has a total velocity. Let's say it's

- 1M events/day
- ~42k events/hour
- ~694 events/min
- ~69 events/sec

Let's say 2kb per event, or roughly 2gb/day, 138kb/sec.

Dirty secret is that "streaming" is usually just "micro-batching". At its most granular, each event is getting written to a database as it happens.

(Yawn) this is just batch load, where batchsize=1.

The point here is that your time requirements (data has to be < 1 min old) act to *constrain your maximum batch size*. In our fake math example, 1 minute means batches can't be bigger than 694 events (on average).

If you had larger batch sizes, you'd probably be violating timeliness. You look too batchy and not streamy enough.

Other hand: batchsize=1 is very streamy, not very batchy. But it also probably wastes resources (excess transactional overhead) without extra "time value"

the trick in this process is to pick a batch size that is always slow enough to maximize time value, while being large enough to minimize relative contribution of transactional overhead and get good performance.

example: if you take 10 million records and insert them 1 record 1 transaction into most databases, you will go very slow. You'd be better off with 10 million records in 1 tx, if you had the resources to do it.

So I derive this rule: the more real-time the data stream needs to be, the less *total throughput* you can achieve.

The more batchy non-real time you can be, the more you can pull the stops out, come up with various optimizations, and maximize total volume throughput/time

https://twitter.com/mdavidallen/status/1318616863610425344?s=20

In batch world, we use "Tables for Labels" to decompose a big dataset into many tables, loading them individually.

https://twitter.com/mdavidallen/status/1318616863610425344?s=20

This is a great way to get good overall throughput on large volumes, but it's in sparky world, where we're very batchy, meaning "Tables for Labels" is a pretty poor fit for real-time streaming (imho)

flip side: writing a complex cypher pattern that takes a complex record and de-composes it / writes it into a multi-hop graph pattern -- this is very good for real time stuff, but sacrifices total throughput, because it creates locks all over the graph & operates on minibatches

It's just yet another case of TANSTAAFL (there ain't no such thing as a free lunch).

Lower latency
Higher throughput
Simpler

Pick 2

https://twitter.com/mdavidallen/status/1392546161421164547

Now let's bear in mind this point, it's all just A -> B

https://twitter.com/mdavidallen/status/1392546161421164547

So "Batch vs. Streaming Data Ingest" to a database is the front-end data movement problem with an exactly corresponding "back-end data movement problem" of batch or streaming data egress from the database to some other downstream component of the architecture

There, all the same issues repeat themselves. The "events" are transactions committing to the database. You literally can't publish *every record that changes* you can publish *every TX that commits* which is composed of many records that changed

in bigger data pipelines, what you see is it's all a bunch of bits moving in and out of pipes & storage sinks (databases), which is expressable as a set of couplings between producers & consumers, or different teams.

At each stage, data ought to get enriched or some value added

this is in contrast to the "code first" view of architectures, which tends to de-emphasize movement of bits, in favor of emphasizing the nature of the value add at each step. "Fancy algo analysis done here", or "Summary reporting done there"

at some point in the future maybe we do a crazier thread about how data & code are really two sides of the same coin (in a sense) -- and also about techniques we can use to transmogrify one into the other en.wikipedia.org/wiki/Homoiconi…

https://twitter.com/nathanwpyle/status/1395036580491014144?s=20

aaaaaaand /thread

https://twitter.com/nathanwpyle/status/1395036580491014144?s=20

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @mdavidallen

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

@mdavidallen

19 May

A function is a DB that maps a key/input set to a value/result that's why they memoize so well

A DB is an impure function that returns a value given a particular input/query

GitHub is a database of programs

And data.gov is a program that returns DBs

Streams and tables are kinda the same thing looked at through different lenses

🤯

docs.confluent.io/platform/curre…

Tables and graphs are kinda the same thing looked at through different lenses

🤯

Read 5 tweets

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

@mdavidallen

26 Nov 19

Seems like a lot of #graph visualization stuff cues off of humans' tendency to want to reason about things in terms of either time, or space.

In a force-directed layout, effectively you have an x/y axis and you're reasoning about the graph in space, where "distance" is used as a proxy for path length.

there are also a lot of Google Earth representations, that try to render the spatial view as more tangible

"Crap on a map" has worked really well for a long time because brains are good at reasoning about known physical spaces

Read 13 tweets

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

@mdavidallen

20 Nov 19

Halin v0.12.0-beta was just published, and open source monitoring tool for Neo4j. Biggest new thing? Support for Neo4j 4.0 milestone releases! Want to know more? Thread 👇

Neo4j 4.0 is in the testing phase. You can read some more here, but 2 biggest new things are:

✅ Multi-database support
✅ Fine-grained security

neo4j.com/blog/neo4j-ent…

This means though that Neo4j is no longer one big graph. It's multiple graphs, strongly separated, and so that's how Halin looks at it.

Each graph can be independently started, stopped, and deleted.

And of course you can make new ones.

Read 10 tweets

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

@mdavidallen

3 Jun 19

Halin v0.11 was just released, with significant new stuff! Also a new UI design. Let's jump in (thread)

Cluster members exist in their own slide-out menu. The "tab per member" approach wasn't working with bigger clusters. Now you have room to grow.

@neo4j

It's now possible (with most recent #APOC) to get storage capacity metrics, so you can see how close you are to filling your disk which tends to make @neo4j very unhappy. Thanks to @mesirii and @santand84 for several things that helped make this possible.

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

Try unrolling a thread yourself!

More from @mdavidallen

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

𝔻𝕒𝕧𝕚𝕕 𝔸𝕝𝕝𝕖𝕟

Did Thread Reader help you today?

Like this author's thread?