The implementation of our new SELECT transform triggered many fascinating discussions at #STOIC. Among them, one was trying to answer the following question: should we encourage users to use SELECT as much as possible, or should we steer them toward simpler Tabula transforms?
For reference purposes, here are Tabula transforms:
By design, Tabula transforms are easier to use and easier to distribute. On the flip side, SQL is an industry standard, and much, much more powerful from a functional and semantic standpoint. But it's also much more difficult to use, and very challenging to distribute.
This creates an interesting dilemma. Here is how we think about it: on one hand, users who are new to SQL will start with Tabula transforms, because they are simpler, and benefit from a point-and-click UI. On the other hand, SQL developers will naturally gravitate toward SELECT.
Therefore, we need to invest both on the UI for our Tabula transforms and our graphical SQL query editor (the UI for the SELECT transform). And we should establish a very progressive path from the former to the latter.
In other words, we must turn #STOIC into the very best SQL learning tool that has ever been brought to market. And to do so, we must create the very best SQL query designer.
To do so, we must start by reducing the gap between Tabula and SQL. This is something that we have been working on for about two years now, and we have developed a pretty good idea of what needs to be done.
For example, our transform used to select columns in a table is called PICK, and it has two modes: DROP and KEEP. The former is a SQL keyword used to drop columns, therefore we should put more emphasis on it.
By the same token, we do not want to multiply the number of transforms that we present to our end user. We now have 23 of them, and we probably want to keep them at that level.
A solution was recently suggested by @olivier_gaucher: use PICK as main transform, but display it as DROP or KEEP when used in the data journey. That way, we have the best of both worlds.
And the same can be done with the FILTER transform, which is used to filter rows in or out. In the data journey, we could display it as PRESERVE or DELETE, depending on whether data is filtered in (PRESERVE) or out (DELETE), knowing that DELETE is a SQL statement to delete rows.
Doing so will turn most of our Tabula transforms as proto-SQL statements, most of them using a single clause. Most importantly, it will introduce our SQL-illiterate users to many of the core concepts of SQL, in a very safe and progressive manner.
From there, we need to give these users a gentle on-ramp to the infinite power of SQL, and to do so, we need to show them that they're actually producing SQL without knowing it.
A naive approach to this challenge would be some kind of "View SQL", which would show the user the SQL code produced by a single transform or a complete data journey. Unfortunately, that would not work so well, because it would show a very complex and deeply-nested SQL statement.
A better approach would consist in outlining a set of consecutive transforms in a data journey that could be collapsed into a single SELECT transform.
For example, if you have a FILTER transform followed by a few COMPUTE transforms, then complemented with a GROUPBY transform, you could easily collapse all these into a single SELECT transform with SELECT, FROM, WHERE, and GROUP BY clauses.
This outline could be displayed as a colored vertical margin on the left side of the transforms in the data journey, with a prompt at the bottom inviting the user to collapse these transforms into a single SELECT.
Mousing over the prompt would preview the resulting SELECT transform, making the link between the collapsed transform and the SELECT transform's clauses quite obvious.
This design would serve two main purposes: first, it would introduce our SQL-illiterate user to the power of SELECT; second, it would allow our user to design SELECT transforms incrementally, by using simpler Tabula transforms step by step.
Initially, our user will probably keep using Tabula transforms before collapsing them into a single SELECT transform. But after a while, there is a good chance that our user will start using the SELECT transform directly, because it will simplify and accelerate their workflow.
In other words, this progressive design will create two opportunities for epiphany: first when our user discover what SELECT does; second when they decide to use SELECT directly.
Good product design creates progressive paths to epiphanies.
With that in mind, let's review where we are today, and where we want to go next. On the Tabula front, we have many years of experience refining our data journey. In its current state, it is working really well, and very little work has to be done there.
All we have to do is to fine tune the point-and-click forms for our 22 transforms other than SELECT. This is very incremental work that will happen on its own, without too much effort.
Things are quite different on the SELECT front though. There, we're just getting started, and we have a lot to learn. But we can also stand on the shoulders of giants, because the topic of SQL query design is a well-researched one.
But before I cover this topic on a separate thread, let me thank @bbonnotte_ for his seminal ideas about Tabula vs. SQL. What was presented on this thread is a direct result of his awesome contributions.
Yesterday, I described a version of our upcoming SQL query designer that focused on making it easier to compose SQL queries, while preserving SQL's hierarchical structure. Today, I want to explore an alternative path.
Instead of taking a SQL-oriented approach, I want to take a task-oriented approach. And I want to free myself from SQL's hierarchical structure, while still producing a well-formed SQL query in the end.
The idea is similar to what Excel's pivot table is doing: give the user a relatively simple graphical user interface to pivot a table across multiple dimensions, and produce a query from it (be it a SQL or a DAX query).
One of #STOIC's most useful features is its signature Summary Charts, which are these bar charts displayed at the top of every column in a table. They work really well, unless your table has very few rows. Here is how we'll improve them for some interesting corner cases.
Relative Baseline in Bar Plots
When a table has less than 50 rows, we replace the Histogram displayed for a numerical column with a Bar Plot visualizing discrete values, but we keep 0 as baseline. We should use MIN or MAX instead, as we do for cell summaries below.
Bar Plot with All Equal Values
When a Bar Plot is produced for a set of values that are all equal, we would want to know that at a glance. To do so, the length of bars will be reduced by 50%, while keeping the plot horizontally centered.
Things that I love about @berkeleygfx's design philosophy:
Emergent over prescribed aesthetics.
Expose state and inner workings.
Dense, not sparse.
Performance is design.
Beauty emerges automatically without deliberation.
Do not infantilize users.
"Emergent over prescribed aesthetics."
Our core UI design was totally emergent. We did not follow any trends. We just tried to render as much qualitative and quantitative information as possible, in a well-organized fashion. Aesthetics were a mere by-product.
"Expose state and inner workings."
That's exactly what we do when we expose the "Runtime Boundary" on the left panel: Lambda functions upstream, EC2 monostore downstream.
One of the main factors that is holding back the design of good data tools for less technical users is our lack of trust in their ability to understand complex mathematical concepts like logarithmic scales or percentiles.
While it is true that the vast majority of such users are not yet familiar or comfortable with these concepts today, they could learn about them over time, and we need to design user interfaces that can facilitate this learning.
This is why I am such a strong proponent of what I call progressive user interfaces. These are user interfaces that use advanced concepts surreptitiously. These concepts are there right in front of the user, yet might not be noticed at first, and don't get in the way.
This thread introduced our upcoming SQL query designer and its two modes: graphical and textual. This other thread will cover the graphical mode in more detail, reviewing each clause of SQL's SELECT statement.
This thread introduced the SQL vs. Tabula discussion. This other thread will cover our SQL query designer. This tool will focus on @duckdb's SQL dialect, but it will be designed to work with virtually any SQL cloud database (@SnowflakeDB, @databricks, etc).
There are many ways to design a SQL query designer, and to design a good one, two critical questions must be answered first: who is the audience, and what is their working environment?
For #STOIC, we must serve two very different user groups:
1. Users who are new to SQL 2. Users who are SQL experts