Now that we have tested all TPC-H queries with our new SELECT transform powered by @duckdb, let's review how #STOIC will produce high-fidelity results for these query results.
One of the most critical design elements that make #STOIC work the way it does is its rich data typing system (Principia Data): each and every column is defined with a precise datatype, which conditions how it is serialized, processed, and analyzed.
Whenever we use a Tabula transform other than SELECT, we automatically produce a high-fidelity schema for the resulting table. This is really effective, and one of the main reasons why Tabula is so attractive (alongside its ease of distribution).
STOIC also distinguishes column identifiers from column names. For example, a column might be identified as "o_totalprice", but displayed as "Total Price". This keeps SQL code as clean as possible, while making table outputs as user-friendly as possible.
Currently, outputs of the SELECT transform benefit from neither. Columns are typed by @duckdb automatically, and identifiers and names are the same.. As a result, we get this kind of table:
There really is nothing wrong about this table, but it's not as good as it could be. Instead, I would like richer datatypes, and user-friendly column names. Thankfully, there is a simple solution to fix all this.
If we look at the SELECT clause of the SELECT transform, we see that 2 out of 10 columns could have their names and datatypes automatically inherited from upstream tables (l_returnflag, l_linestatus).
And for the 8 others, we can add inline comments that will give columns proper names and datatypes. In this case, all the datatypes inferred by @duckdb are correct, but we could override them with something like type:level.
What is the "level" datatype you might ask? It's the datatype for a continuous numerical variable that cannot be summed. For example, a temperature. And by default, an average should always be typed as a level.
And as I was writing these lines, I realized that I forgot to properly type my averages. Here is the fix...
Using inline comments in the SQL code will allow data analysts to provide this information directly within the SELECT transform, instead of having to use a CONFIGURE transform downstream of the SELECT transform. This will keep the data journey as lean as possible.
It will also make the SELECT transform fully self-contained, which will become very important when we add support for standalone Queries packaged as JSON files, using our upcoming File page type. Currently, we only have Journeys, Sheets, and Notebooks:
Meta data added as inline comments within the SQL code will use a very simple syntax that will be trivial to parse: key-value pairs separated with a comma and an optional space, with key and value separated by a colon and optional spaces.
Keys will be user-defined, and we will support the following ones:
- name
- type
- desc
With inline comments, we make this technique database agnostic, which means that it will work with any SQL dialect, instead of being limited to @duckdb's. This is important, because the SQL we generate can be pushed down to an upstream database like @SnowflakeDB or @databricks.
The only technical challenge is that we need to parse the contents of the SELECT clause. Fortunately, SELECT clauses are relatively simple, and we're just looking for patterns like:
AS alias /* ... */
Therefore, we can do it without having to rely on a full SQL parser.
Producing proper schemas for SELECT results will be one of our main projects for the upcoming week. In parallel, we will add support for auto-completion, at least for upstream table and column identifiers.
After that, we will make the UI for SELECT more point-and-click, with nice selectors for tables and columns, as well as graphical editors for editing the predicates of WHERE clauses. But we'll still preserve the ability to write SQL code by hand.
Graphical clause editors will require a proper SQL parser though.
Yesterday, I described a version of our upcoming SQL query designer that focused on making it easier to compose SQL queries, while preserving SQL's hierarchical structure. Today, I want to explore an alternative path.
Instead of taking a SQL-oriented approach, I want to take a task-oriented approach. And I want to free myself from SQL's hierarchical structure, while still producing a well-formed SQL query in the end.
The idea is similar to what Excel's pivot table is doing: give the user a relatively simple graphical user interface to pivot a table across multiple dimensions, and produce a query from it (be it a SQL or a DAX query).
One of #STOIC's most useful features is its signature Summary Charts, which are these bar charts displayed at the top of every column in a table. They work really well, unless your table has very few rows. Here is how we'll improve them for some interesting corner cases.
Relative Baseline in Bar Plots
When a table has less than 50 rows, we replace the Histogram displayed for a numerical column with a Bar Plot visualizing discrete values, but we keep 0 as baseline. We should use MIN or MAX instead, as we do for cell summaries below.
Bar Plot with All Equal Values
When a Bar Plot is produced for a set of values that are all equal, we would want to know that at a glance. To do so, the length of bars will be reduced by 50%, while keeping the plot horizontally centered.
Things that I love about @berkeleygfx's design philosophy:
Emergent over prescribed aesthetics.
Expose state and inner workings.
Dense, not sparse.
Performance is design.
Beauty emerges automatically without deliberation.
Do not infantilize users.
"Emergent over prescribed aesthetics."
Our core UI design was totally emergent. We did not follow any trends. We just tried to render as much qualitative and quantitative information as possible, in a well-organized fashion. Aesthetics were a mere by-product.
"Expose state and inner workings."
That's exactly what we do when we expose the "Runtime Boundary" on the left panel: Lambda functions upstream, EC2 monostore downstream.
One of the main factors that is holding back the design of good data tools for less technical users is our lack of trust in their ability to understand complex mathematical concepts like logarithmic scales or percentiles.
While it is true that the vast majority of such users are not yet familiar or comfortable with these concepts today, they could learn about them over time, and we need to design user interfaces that can facilitate this learning.
This is why I am such a strong proponent of what I call progressive user interfaces. These are user interfaces that use advanced concepts surreptitiously. These concepts are there right in front of the user, yet might not be noticed at first, and don't get in the way.
This thread introduced our upcoming SQL query designer and its two modes: graphical and textual. This other thread will cover the graphical mode in more detail, reviewing each clause of SQL's SELECT statement.
This thread introduced the SQL vs. Tabula discussion. This other thread will cover our SQL query designer. This tool will focus on @duckdb's SQL dialect, but it will be designed to work with virtually any SQL cloud database (@SnowflakeDB, @databricks, etc).
There are many ways to design a SQL query designer, and to design a good one, two critical questions must be answered first: who is the audience, and what is their working environment?
For #STOIC, we must serve two very different user groups:
1. Users who are new to SQL 2. Users who are SQL experts