Now that we have tested all TPC-H queries with our new SELECT transform powered by @duckdb, let's review how #STOIC will produce high-fidelity results for these query results.
One of the most critical design elements that make #STOIC work the way it does is its rich data typing system (Principia Data): each and every column is defined with a precise datatype, which conditions how it is serialized, processed, and analyzed.
Whenever we use a Tabula transform other than SELECT, we automatically produce a high-fidelity schema for the resulting table. This is really effective, and one of the main reasons why Tabula is so attractive (alongside its ease of distribution).
STOIC also distinguishes column identifiers from column names. For example, a column might be identified as "o_totalprice", but displayed as "Total Price". This keeps SQL code as clean as possible, while making table outputs as user-friendly as possible.
Currently, outputs of the SELECT transform benefit from neither. Columns are typed by @duckdb automatically, and identifiers and names are the same.. As a result, we get this kind of table:
There really is nothing wrong about this table, but it's not as good as it could be. Instead, I would like richer datatypes, and user-friendly column names. Thankfully, there is a simple solution to fix all this.
If we look at the SELECT clause of the SELECT transform, we see that 2 out of 10 columns could have their names and datatypes automatically inherited from upstream tables (l_returnflag, l_linestatus).
And for the 8 others, we can add inline comments that will give columns proper names and datatypes. In this case, all the datatypes inferred by @duckdb are correct, but we could override them with something like type:level.
What is the "level" datatype you might ask? It's the datatype for a continuous numerical variable that cannot be summed. For example, a temperature. And by default, an average should always be typed as a level.
And as I was writing these lines, I realized that I forgot to properly type my averages. Here is the fix...
Using inline comments in the SQL code will allow data analysts to provide this information directly within the SELECT transform, instead of having to use a CONFIGURE transform downstream of the SELECT transform. This will keep the data journey as lean as possible.
It will also make the SELECT transform fully self-contained, which will become very important when we add support for standalone Queries packaged as JSON files, using our upcoming File page type. Currently, we only have Journeys, Sheets, and Notebooks:
Meta data added as inline comments within the SQL code will use a very simple syntax that will be trivial to parse: key-value pairs separated with a comma and an optional space, with key and value separated by a colon and optional spaces.
Keys will be user-defined, and we will support the following ones:
- name
- type
- desc
With inline comments, we make this technique database agnostic, which means that it will work with any SQL dialect, instead of being limited to @duckdb's. This is important, because the SQL we generate can be pushed down to an upstream database like @SnowflakeDB or @databricks.
The only technical challenge is that we need to parse the contents of the SELECT clause. Fortunately, SELECT clauses are relatively simple, and we're just looking for patterns like:
AS alias /* ... */
Therefore, we can do it without having to rely on a full SQL parser.
Producing proper schemas for SELECT results will be one of our main projects for the upcoming week. In parallel, we will add support for auto-completion, at least for upstream table and column identifiers.
After that, we will make the UI for SELECT more point-and-click, with nice selectors for tables and columns, as well as graphical editors for editing the predicates of WHERE clauses. But we'll still preserve the ability to write SQL code by hand.
Graphical clause editors will require a proper SQL parser though.
@haro_ca_ had some interesting questions about OLTP and @DuckDB. This is a vast topic that will require a lot more than a single thread, but let me take a first crack at it.
The main idea is how to handle OLAP queries on mutable data. OLAP queries on immutable data are pretty straightforward. But as soon as the data can mutate, things start getting interesting. That's what I'm going to focus on with this thread.
When talking about mutable data, one has to distinguish different use cases:
1. Inserting new rows of immutable data (e.g. logs). 2. Updating existing rows. 3. Deeply transforming the data.
When I first came to the Bay Area in 1999, the commercial open source company I founded was the second contributor to the Apache Software Foundation after IBM in terms of number of top-level projects. It failed because of my inexperience and stupidity.
For close to 20 years, I was not involved in any major open source project, because I did not think that I could contribute anything of value. I really hope to change that moving forward.
My main problem with open source is that it is really difficult to create a sustainable business model around it. You start with a liberal license and good intentions, you then realize that you need to make money and switch to AGPL, and the community ends up fragmenting itself.
@DuckDB calls itself "an in-process
SQL OLAP database management system", but this tagline can be quite misleading. Because the @DuckDBLabs are modest (and smart), they omit the fact that it has everything needed to make it an OLTP database as well...
When you read this page, you learn that "DuckDB provides transactional guarantees (ACID properties) through our custom, bulk-optimized Multi-Version Concurrency Control (MVCC)."
And if you know anything about transactions, you must know that MVCC is a really big deal. And doing that as part of an in-process vectorized SQL engine is... well... groundbreaking. There is nothing like that out there...
I keep mentioning @RedisInc in the context of @PuffinDB, but I've never taken the time to really explain the role it is going to play, and why we have decided to use it.
In my opinion, Redis is one of the best pieces of software technology out there. It's really fast, really solid, really well designed, and really well aged. It's really, really good. For the longest time, and it was the most well-liked database, for good reasons.
What's less well understood is how (and why) Redis could (and should) be used for building another database on top of it, or a low-latency alternative to Apache Iceberg.
We had our bi-weekly call with @duckdblabs this morning. Good news all around: @duckdb's SQL parser will soon be exposed through the SQL API, generating an Abstract Syntax Tree (AST) from a SQL query as a JSON. Next, they'll do the same for generating a full relational tree.
These two features will be critical for implementing our distributed query planner. And we confirmed that we'll be able to use @substrait_io for exchanging physical query plans from the distributed query engine to @duckdb nodes.
We discussed our ideas regarding the distributed query planner (YAML DSL with TypeScript, dynamically-injected optimizer rules, distribution of planning across multiple Lambda functions), and no particular issues were raised. The project is ambitious for sure, but seems doable.
Now that we have a pretty good idea of what the UI will look like, let's focus on the runtime side of things, especially when data gets too large to fit within a single host (2 TB on a host with 4 TB of RAM). In other words, how will the distributed query planner work?
The assumption is that source data is partitioned as a set of Parquet files (with or without Iceberg). From there, if your query is map-reducible, you can use @DuckDB's COMBINE to map it on a fleet of Lambdas, then reduce it on an EC2 VM. That's what #STOIC does today.
Things start getting more complicated if you have a SELECT within a FROM clause, a HAVING clause, or inner queries (correlated or uncorrelated). Let's review these scenarios one by one.