Ismael Ghalimi Profile picture
Jan 16 15 tweets 3 min read
Now that we have a pretty good idea of what the UI will look like, let's focus on the runtime side of things, especially when data gets too large to fit within a single host (2 TB on a host with 4 TB of RAM). In other words, how will the distributed query planner work?
The assumption is that source data is partitioned as a set of Parquet files (with or without Iceberg). From there, if your query is map-reducible, you can use @DuckDB's COMBINE to map it on a fleet of Lambdas, then reduce it on an EC2 VM. That's what #STOIC does today.
Things start getting more complicated if you have a SELECT within a FROM clause, a HAVING clause, or inner queries (correlated or uncorrelated). Let's review these scenarios one by one.
SELECT in FROM Clause

In that case, you want to distribute the inner SELECT across Lambda functions, and push down into the Lambdas as many parts of the outer query as possible. After that, you reduce everything on the monostore (single EC2 VM). No big deal.
HAVING in GROUP BY

In that case, you want to execute the inner query within the Lambdas executing the map parts of the GROUP BY (if you can), before sending filtered binary results to the monostore for reduction (with COMBINE).
Uncorrelated Query

This one is simple: run it once on a Lambda if it's dealing with small data, or on the monostore if it's not, then make its result available to all Lambdas running the map part of the outer query.
Correlated Query

This one is where things get really interesting. But before we dive deep into it, let's try to set some boundaries: in some extreme cases, it simply won't work in a distributed fashion, no matter what you try.
Here is an example of such an extreme case: if the denormalized table for the inner query is large (or very large) and cannot be partitioned across the same dimension(s) as the denormalized table for the outer query and the later produces a large result, you're out of luck.
In that case, no matter what you do, each inner query will be expensive, and you'll have to execute as many of them as you have rows in the result of the outer query.
What that means is that if a large query takes 1s on your fleet of Lambdas and your outer query has 1B rows, the full query will take at least 1 billion seconds, or more than 11 days...
With that out of the way, let's review some scenarios that will work a lot better. For example, if the result of your outer query is small, things are easy. And if the dataset of the inner query is small, you can replicate it on the Lambdas running the outer query's map part.
If the outer query generates a large result and the inner query is based on an equally large dataset, your only option is to align their respective partitions. Unfortunately, this is not always possible, and even if it is, it might not be easy.
But there is one more trick that you can pull: you can try to include in the results of the outer query the dimension(s) used for partitioning the inner query's dataset. If you can do that, you can then partition the outer query's result by this/these dimension(s).
From there, you pass each outer result partition to the Lambda handling the corresponding inner dataset partition, hoping that the former are not too skewed. And if they are, you just assign multiple Lambdas to the same inner dataset partition.
That, in a nutshell, will be the strategies implemented by the first version of our distributed query planner. It will handle all 22 TPC-H queries fairly well, and many more. What is not covered in this design is things like CUBE, GROUPING SETS, and WINDOW, so stay tuned...

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ismael Ghalimi

Ismael Ghalimi Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ghalimi

Jan 16
Rules of engagement: this Twitter account is my personal account. I use it to liberally share many of the things that we do at #STOIC. While this leaks some of our IP, I tend to think that we gain more than we lose by doing so, because it helps us engage with the community.
I use my long threads to capture my stream of consciousness. Writing my ideas down helps me think them through, even if nobody reads them on the other end. This makes for a very painful account to follow, and most followers end up tuning out eventually, but I don't mind.
I do my best to answer any questions, but I can't share much code, because #STOIC's codebase is not open source. Our contributions to open source are done through sponsorships of critical projects like @DuckDB.
Read 4 tweets
Jan 16
Here is how #STOIC intends to use this feature. Obviously, there are many applications for a native SQL parser/serializer, but this particular one might give you some ideas for your own projects.
Here is a screenshot of our current UI. For us, SELECT is a transform that can be used in a data journey to generate a new table, either directly from an import, or downstream of other transforms.
Right now, the parameters of this transform (the clauses of the SELECT statement) must be coded by hand. This is fine if you're a SQL expert, but this is a non-starter if you're a casual Excel user. We want our product to be usable by both. The latter needs something more.
Read 23 tweets
Jan 16
Public announcement: if your company is using @DuckDB, you should consider sponsoring @DuckDBLabs (these folks are great to work with). And if you do and your needs are aligned with #STOIC's, we should have a chat about priorities and design requirements.
If we pool resources together, we might be able to fund things that would be out of reach for #STOIC on its own, or will take a lot longer to develop, for lack of sufficient resources.
And for the record, 100% of the @DuckDBLabs work funded by #STOIC goes straight to the open source codebase, and we have no intentions of changing that anytime soon.
Read 4 tweets
Jan 15
Yesterday, I described a version of our upcoming SQL query designer that focused on making it easier to compose SQL queries, while preserving SQL's hierarchical structure. Today, I want to explore an alternative path.
Instead of taking a SQL-oriented approach, I want to take a task-oriented approach. And I want to free myself from SQL's hierarchical structure, while still producing a well-formed SQL query in the end.
The idea is similar to what Excel's pivot table is doing: give the user a relatively simple graphical user interface to pivot a table across multiple dimensions, and produce a query from it (be it a SQL or a DAX query).
Read 61 tweets
Jan 15
One of #STOIC's most useful features is its signature Summary Charts, which are these bar charts displayed at the top of every column in a table. They work really well, unless your table has very few rows. Here is how we'll improve them for some interesting corner cases.
Relative Baseline in Bar Plots

When a table has less than 50 rows, we replace the Histogram displayed for a numerical column with a Bar Plot visualizing discrete values, but we keep 0 as baseline. We should use MIN or MAX instead, as we do for cell summaries below.
Bar Plot with All Equal Values

When a Bar Plot is produced for a set of values that are all equal, we would want to know that at a glance. To do so, the length of bars will be reduced by 50%, while keeping the plot horizontally centered.
Read 18 tweets
Jan 15
Things that I love about @berkeleygfx's design philosophy:

Emergent over prescribed aesthetics.
Expose state and inner workings.
Dense, not sparse.
Performance is design.
Beauty emerges automatically without deliberation.
Do not infantilize users.
"Emergent over prescribed aesthetics."

Our core UI design was totally emergent. We did not follow any trends. We just tried to render as much qualitative and quantitative information as possible, in a well-organized fashion. Aesthetics were a mere by-product.
"Expose state and inner workings."

That's exactly what we do when we expose the "Runtime Boundary" on the left panel: Lambda functions upstream, EC2 monostore downstream.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(