Tweet

Ismael Ghalimi

Jan 16 • 23 tweets • 8 min read

https://twitter.com/ghalimi/status/1615028240615370757

Here is how #STOIC intends to use this feature. Obviously, there are many applications for a native SQL parser/serializer, but this particular one might give you some ideas for your own projects.

https://twitter.com/ghalimi/status/1615028240615370757

Here is a screenshot of our current UI. For us, SELECT is a transform that can be used in a data journey to generate a new table, either directly from an import, or downstream of other transforms.

Right now, the parameters of this transform (the clauses of the SELECT statement) must be coded by hand. This is fine if you're a SQL expert, but this is a non-starter if you're a casual Excel user. We want our product to be usable by both. The latter needs something more.

https://twitter.com/ghalimi/status/1614714449444876288

What we want is the UI to automatically generate the clauses of our SELECT statement, through UI components like this column menu. This thread explains it in more detail:

https://twitter.com/ghalimi/status/1614714449444876288

Our goal is simple, yet very ambitious: we want a casual user who received less than one hour of training to be able to produce all 22 TPC-H queries through a point-and-click interface, with minimalist use of simple Excel-like expressions.

To do so, our UI must produce the SQL clauses, and that's relatively straightforward, at least if you approach the problem the right way. The problem then becomes the following: how do you serialize the UI's output in some document (a workbook) that can be reused later on?

There, you basically have two options: either you invent your own grammar, or you embrace SQL. We've decided to use plain SQL, for several reasons:

1. It's the only industry standard.
2. It's easy to edit for SQL developers.
3. It ensures that we don't re-invent the wheel.

The second point is critical: we want the SQL code produced by our UI to be editable by SQL developers, and we want our UI to be able to import existing SQL statements. Therefore, SQL is the way to go.

Now, there is one big problem with that: which dialect of SQL should we use. SQL is the industry standard, but it comes in many different flavor (one per database engine essentially). Therefore, we need to pick one.

@SnowflakeDB

While we offer direct connection to multiple cloud databases (@SnowflakeDB, @DataBricks), @DuckDB is our embedded engine, which we use both server-side (using Node.js) and client-side (using WASM). Therefore, it makes sense to use @DuckDB's SQL dialect.

@DuckDB

But if we go down that path, we want to remain as faithful to the dialect as possible, and there is no better way of doing that than having @DuckDB itself do the parsing and serializing for us. Hence the need for this particular feature.

@DuckDB

With that approach, the SELECT transforms used in data journeys of our #STOIC workbook will be serialized in plain SQL (actually, a simple JSON with one string per root clause). From there, we will use @DuckDB to generate an AST for it.

@DuckDB

When users will use these column menus, the AST will be updated interactively, and we will use @DuckDB's serializer to convert it back into a SQL query that will be executed on our embedded DuckDB engine.

@SnowflakeDB

But if the query has to be executed on another database (@SnowflakeDB, @DataBricks), we will simply use a different serializer (writing a SQL serializer is a lot easier than writing a SQL dialect translator).

@DuckDB

Having the parser available client-side (through @DuckDB WASM) will also be useful when we let SQL developers modify the SQL clauses of SELECT transforms directly: this will allow us to feed our auto-completion engine with highly-contextual suggestions.

To make this UI as interactive as possible, we will automatically materialize and denormalize all tables of the FROM clause in memory (using a CREATE TABLE). As long as the data is smaller than 2 TB, this will work just fine.

We will then encourage users to build queries through our column menus in this order:

1. Filter the data as aggressively as you can
2. Set your GROUP BY column (if any)
3. Add your columns (feeding the SELECT clause)
4. Define your inner queries (recursively)
5. Sort rows

The resulting SQL query will be re-executed at every interaction, giving users real-time feedback on their work. And if they start becoming slow, they can always configure some sampling options that will make them a lot faster.

Now, here is a very important detail: since we're producing all this SQL ourselves from basic user interactions, we can hope to understand what's going on, and we can feed this knowledge to our own distributed query planner.

@DuckDB

Therefore, the SQL query that you will see in the SELECT transform isn't necessarily the actual SQL query that will be executed by @DuckDB running on our EC2 monostore (single VM).

Instead, it's an abstract query that might be broken down into multiple queries. The main query will run on the monostore, but some upstream queries might run on a fleet of Lambda functions (especially when adding more tables to the FROM clause).

Similarly, correlated inner queries might also be deployed on a fleet of Lambda functions, themselves interacting with an EC2 cluster holding large partitions of the denormalized table used in the inner queries.

Bottomline: not only will this UI help SQL-illiterate users implement all TPC-H queries, it will allow them to distribute them on large fleets of Lambda functions and large clusters of EC2 VMs, automatically.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @ghalimi

Ismael Ghalimi

@ghalimi

Jan 16

https://twitter.com/ghalimi/status/1615065408599654401

Now that we have a pretty good idea of what the UI will look like, let's focus on the runtime side of things, especially when data gets too large to fit within a single host (2 TB on a host with 4 TB of RAM). In other words, how will the distributed query planner work?

https://twitter.com/ghalimi/status/1615065408599654401

@DuckDB

The assumption is that source data is partitioned as a set of Parquet files (with or without Iceberg). From there, if your query is map-reducible, you can use @DuckDB's COMBINE to map it on a fleet of Lambdas, then reduce it on an EC2 VM. That's what #STOIC does today.

Things start getting more complicated if you have a SELECT within a FROM clause, a HAVING clause, or inner queries (correlated or uncorrelated). Let's review these scenarios one by one.

Read 15 tweets

Ismael Ghalimi

@ghalimi

Jan 16

Rules of engagement: this Twitter account is my personal account. I use it to liberally share many of the things that we do at #STOIC. While this leaks some of our IP, I tend to think that we gain more than we lose by doing so, because it helps us engage with the community.

I use my long threads to capture my stream of consciousness. Writing my ideas down helps me think them through, even if nobody reads them on the other end. This makes for a very painful account to follow, and most followers end up tuning out eventually, but I don't mind.

@DuckDB

I do my best to answer any questions, but I can't share much code, because #STOIC's codebase is not open source. Our contributions to open source are done through sponsorships of critical projects like @DuckDB.

Read 4 tweets

Ismael Ghalimi

@ghalimi

Jan 16

@DuckDB

Public announcement: if your company is using @DuckDB, you should consider sponsoring @DuckDBLabs (these folks are great to work with). And if you do and your needs are aligned with #STOIC's, we should have a chat about priorities and design requirements.

If we pool resources together, we might be able to fund things that would be out of reach for #STOIC on its own, or will take a lot longer to develop, for lack of sufficient resources.

@DuckDBLabs

And for the record, 100% of the @DuckDBLabs work funded by #STOIC goes straight to the open source codebase, and we have no intentions of changing that anytime soon.

Read 4 tweets

Ismael Ghalimi

@ghalimi

Jan 15

https://twitter.com/ghalimi/status/1614266907666055169

Yesterday, I described a version of our upcoming SQL query designer that focused on making it easier to compose SQL queries, while preserving SQL's hierarchical structure. Today, I want to explore an alternative path.

https://twitter.com/ghalimi/status/1614266907666055169

Instead of taking a SQL-oriented approach, I want to take a task-oriented approach. And I want to free myself from SQL's hierarchical structure, while still producing a well-formed SQL query in the end.

The idea is similar to what Excel's pivot table is doing: give the user a relatively simple graphical user interface to pivot a table across multiple dimensions, and produce a query from it (be it a SQL or a DAX query).

Read 61 tweets

Ismael Ghalimi

@ghalimi

Jan 15

One of #STOIC's most useful features is its signature Summary Charts, which are these bar charts displayed at the top of every column in a table. They work really well, unless your table has very few rows. Here is how we'll improve them for some interesting corner cases.

Relative Baseline in Bar Plots

When a table has less than 50 rows, we replace the Histogram displayed for a numerical column with a Bar Plot visualizing discrete values, but we keep 0 as baseline. We should use MIN or MAX instead, as we do for cell summaries below.

Bar Plot with All Equal Values

When a Bar Plot is produced for a set of values that are all equal, we would want to know that at a glance. To do so, the length of bars will be reduced by 50%, while keeping the plot horizontally centered.

Read 18 tweets

Ismael Ghalimi

@ghalimi

Jan 15

@berkeleygfx

Things that I love about @berkeleygfx's design philosophy:

Emergent over prescribed aesthetics.
Expose state and inner workings.
Dense, not sparse.
Performance is design.
Beauty emerges automatically without deliberation.
Do not infantilize users.

"Emergent over prescribed aesthetics."

Our core UI design was totally emergent. We did not follow any trends. We just tried to render as much qualitative and quantitative information as possible, in a well-organized fashion. Aesthetics were a mere by-product.

"Expose state and inner workings."

That's exactly what we do when we expose the "Runtime Boundary" on the left panel: Lambda functions upstream, EC2 monostore downstream.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Ismael Ghalimi

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @ghalimi

Ismael Ghalimi

Ismael Ghalimi

Ismael Ghalimi

Ismael Ghalimi

Ismael Ghalimi

Ismael Ghalimi

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!