Ismael Ghalimi Profile picture
Renaissance steampunk toolmaker

Jan 14, 2023, 21 tweets

Now that we have tested all TPC-H queries with our new SELECT transform powered by @duckdb, let's review how #STOIC will produce high-fidelity results for these query results.

One of the most critical design elements that make #STOIC work the way it does is its rich data typing system (Principia Data): each and every column is defined with a precise datatype, which conditions how it is serialized, processed, and analyzed.

stoic.com/principia/data…

Here is the full list of datatypes:

github.com/stoic-doc/Comm…

Whenever we use a Tabula transform other than SELECT, we automatically produce a high-fidelity schema for the resulting table. This is really effective, and one of the main reasons why Tabula is so attractive (alongside its ease of distribution).

STOIC also distinguishes column identifiers from column names. For example, a column might be identified as "o_totalprice", but displayed as "Total Price". This keeps SQL code as clean as possible, while making table outputs as user-friendly as possible.

Currently, outputs of the SELECT transform benefit from neither. Columns are typed by @duckdb automatically, and identifiers and names are the same.. As a result, we get this kind of table:

There really is nothing wrong about this table, but it's not as good as it could be. Instead, I would like richer datatypes, and user-friendly column names. Thankfully, there is a simple solution to fix all this.

If we look at the SELECT clause of the SELECT transform, we see that 2 out of 10 columns could have their names and datatypes automatically inherited from upstream tables (l_returnflag, l_linestatus).

And for the 8 others, we can add inline comments that will give columns proper names and datatypes. In this case, all the datatypes inferred by @duckdb are correct, but we could override them with something like type:level.

What is the "level" datatype you might ask? It's the datatype for a continuous numerical variable that cannot be summed. For example, a temperature. And by default, an average should always be typed as a level.

github.com/stoic-doc/Comm…

And its discrete counterpart is an index:

github.com/stoic-doc/Comm…

And as I was writing these lines, I realized that I forgot to properly type my averages. Here is the fix...

Using inline comments in the SQL code will allow data analysts to provide this information directly within the SELECT transform, instead of having to use a CONFIGURE transform downstream of the SELECT transform. This will keep the data journey as lean as possible.

It will also make the SELECT transform fully self-contained, which will become very important when we add support for standalone Queries packaged as JSON files, using our upcoming File page type. Currently, we only have Journeys, Sheets, and Notebooks:

Meta data added as inline comments within the SQL code will use a very simple syntax that will be trivial to parse: key-value pairs separated with a comma and an optional space, with key and value separated by a colon and optional spaces.

Keys will be user-defined, and we will support the following ones:

- name
- type
- desc

With inline comments, we make this technique database agnostic, which means that it will work with any SQL dialect, instead of being limited to @duckdb's. This is important, because the SQL we generate can be pushed down to an upstream database like @SnowflakeDB or @databricks.

The only technical challenge is that we need to parse the contents of the SELECT clause. Fortunately, SELECT clauses are relatively simple, and we're just looking for patterns like:

AS alias /* ... */

Therefore, we can do it without having to rely on a full SQL parser.

Producing proper schemas for SELECT results will be one of our main projects for the upcoming week. In parallel, we will add support for auto-completion, at least for upstream table and column identifiers.

After that, we will make the UI for SELECT more point-and-click, with nice selectors for tables and columns, as well as graphical editors for editing the predicates of WHERE clauses. But we'll still preserve the ability to write SQL code by hand.

Graphical clause editors will require a proper SQL parser though.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling