This thread introduced the SQL vs. Tabula discussion. This other thread will cover our SQL query designer. This tool will focus on @duckdb's SQL dialect, but it will be designed to work with virtually any SQL cloud database (@SnowflakeDB, @databricks, etc).
There are many ways to design a SQL query designer, and to design a good one, two critical questions must be answered first: who is the audience, and what is their working environment?
For #STOIC, we must serve two very different user groups:
1. Users who are new to SQL 2. Users who are SQL experts
These two users groups will have fundamentally different sets of needs. Therefore, our SQL designer will probably have two different modes, or will be used in two fundamentally different ways, by two fundamentally different groups of users.
But before we dive into that, let's review their working contexts: #STOIC is a fully integrated tool. Therefore, our query designer is not a tool aimed at designing arbitrary and standalone SQL queries.
Instead, it's a tool aimed at designing SQL queries tightly integrated with the dependency graph produced from a #STOIC workbook. That being said, while most of these queries will be executed by our embedded @duckdb engine, many will run upstream, on external cloud databases.
Another critical assumption is that such queries are executed either against external databases that have arbitrary schemas, or against #STOIC's own distributed database engine, using snowflake schemas with hierarchical denormalization. And the latter is our #1 priority.
Yet another critical assumption is that such queries will serve a large and diversified user base made of dozens of Creators, hundreds of Explorers, and thousands to millions of Viewers. As a result, caching is really, really important.
In other words, this SQL query designer must help our users take full advantage of the awesome power offered by our multi-level Reactive Caching Service (RCS). To a very large extent, this is its #1 raison d'être.
With that in mind, when designing this SQL query designer, we must embrace a very particular workflow, which I would describe as table-oriented: everything starts from a table of data, and everything is driven by the journey which this table embarks upon.
For example, what that means is that our user does not start from an Entity-Relationship Model (ERM). Instead, our user starts from a set of tables, then establishes relationships between them, by properly typing certain columns of these tables using our "relation" datatype.
This is a much more natural way of doing things, supported by the use of VLOOKUP by tens of millions of Excel users. And these relations allow us to produce this type of ERM automatically:
From this ERM, we can help our user adjust the schema of their multi-table dataset, bringing it closer and closer to a proper snowflake schema, and we can automate the process of denormalizing its tables in a hierarchical manner.
Finally, we must remember that our SQL query designer must facilitate the distribution of SQL queries across clusters of EC2 VMs/Fargates and across a fleets of Lambda functions. And distributing SQL queries is, by far, the hardest problem that we will have to solve.
But one must keep in mind that this problem cannot be solved entirely. SQL is simply too rich. If queries become a bit too complex and are made against datasets that are a bit too large, there simply is no way to execute them efficiently on a distributed platform.
Therefore, the SQL query designer should steer users in a direction that will make this distribution as efficient as possible, for datasets that are as large as possible.
This will be achieved using many complementary techniques. Among them, denormalizing datasets whenever possible, partitioning large tables alongside multiple partitioning schemes, aligning partitions across multiple tables, and breaking queries into multiple ones.
With all that in mind, let's review what this tool will look like. First and foremost, it will have two complementary operating modes: graphical and textual. The graphical mode will be used by novice users, while the textual mode will be used by expert users.
Most importantly, this two modes will be available at all times, with full roundtrip engineering. What this means is that you will be able to go from graphical to textual and back to graphical, at any time, for any part of a query.
This is possible, because SQL is a relatively small and simple language from a syntax and structure standpoint. Its complexity applies at a topological level, unlike programming languages like C++ or Java.
For these languages, past experiences with UML have clearly demonstrated that roundtrip engineering does not work in the real world. It can work within certain environments for expert users demonstrating an unusual level of discipline, but it breaks down everywhere else.
SQL is different though. First, it has relatively few primitives, and these are assembled in relatively rigid ways. For example, a SELECT statement contains a single WHERE clause at the root.
Second, a SELECT statement is a relatively standalone entity, blessed with a very hierarchical structure. This is unlike traditional code for which you have many functions scattered across many different files, referencing each other through very complex dependency graphs.
For these reasons (and many others), roundtrip engineering with SQL is within reach, therefore should be taken advantage of. The graphical representation of a SELECT statement will be topologically isomorphic to its textual representation, which simplifies a ton of things.
At this point, I need to outline a critical design element: for SQL experts to be fully productive with the textual mode of this SQL query designer, proper auto-completion must be offered, across all clauses, for things like tables, columns, functions, and keywords.
But this auto-completion should not be confused with a fully graphical clause editor. For example, if we look at a FROM clause, the novice user will click on some "Add table" link to add a table, while the expert user will type a comma to do the same.
And yet both interactions should display the same table picker. This commonality of UI controls is what will make this tool truly progressive. Yet one should not consider clicking on "Add table" and typing a comma to be equivalent: to novice users, the latter is out of reach.
Of the two, the textual mode is obviously the easiest one to build. In a nutshell, all that is needed is a solid auto-completion feature. It is by no means a trivial one to implement well, but it is a relatively well-bounded effort. The graphical mode is where the challenge lies.
The graphical mode is complex because each clause has a different syntax. Therefore, the UI must take that into account, and generate a proper UI flow that won't get our user too confused too fast, especially when SELECT statements get deeply nested.
For that part, we should take advantage of a critical design element of the #STOIC UI: the use of Miller columns for the left panels. With our upcoming editor, the SELECT nested within the WHERE clause will be displayed within a separate Miller column.
This design will provide a much better hierarchical structuring of complex queries, while giving us the ability to preview the results of nested SELECT statements through the table displayed on the right.
This will create some really interesting challenges when these nested SELECT statements reference columns defined by parent SELECT statements, but it will make it a lot easier to construct and debug complex queries with deep nesting and complex joins.
In the next thread, we will review all clauses one by one.
Yesterday, I described a version of our upcoming SQL query designer that focused on making it easier to compose SQL queries, while preserving SQL's hierarchical structure. Today, I want to explore an alternative path.
Instead of taking a SQL-oriented approach, I want to take a task-oriented approach. And I want to free myself from SQL's hierarchical structure, while still producing a well-formed SQL query in the end.
The idea is similar to what Excel's pivot table is doing: give the user a relatively simple graphical user interface to pivot a table across multiple dimensions, and produce a query from it (be it a SQL or a DAX query).
One of #STOIC's most useful features is its signature Summary Charts, which are these bar charts displayed at the top of every column in a table. They work really well, unless your table has very few rows. Here is how we'll improve them for some interesting corner cases.
Relative Baseline in Bar Plots
When a table has less than 50 rows, we replace the Histogram displayed for a numerical column with a Bar Plot visualizing discrete values, but we keep 0 as baseline. We should use MIN or MAX instead, as we do for cell summaries below.
Bar Plot with All Equal Values
When a Bar Plot is produced for a set of values that are all equal, we would want to know that at a glance. To do so, the length of bars will be reduced by 50%, while keeping the plot horizontally centered.
Things that I love about @berkeleygfx's design philosophy:
Emergent over prescribed aesthetics.
Expose state and inner workings.
Dense, not sparse.
Performance is design.
Beauty emerges automatically without deliberation.
Do not infantilize users.
"Emergent over prescribed aesthetics."
Our core UI design was totally emergent. We did not follow any trends. We just tried to render as much qualitative and quantitative information as possible, in a well-organized fashion. Aesthetics were a mere by-product.
"Expose state and inner workings."
That's exactly what we do when we expose the "Runtime Boundary" on the left panel: Lambda functions upstream, EC2 monostore downstream.
One of the main factors that is holding back the design of good data tools for less technical users is our lack of trust in their ability to understand complex mathematical concepts like logarithmic scales or percentiles.
While it is true that the vast majority of such users are not yet familiar or comfortable with these concepts today, they could learn about them over time, and we need to design user interfaces that can facilitate this learning.
This is why I am such a strong proponent of what I call progressive user interfaces. These are user interfaces that use advanced concepts surreptitiously. These concepts are there right in front of the user, yet might not be noticed at first, and don't get in the way.
This thread introduced our upcoming SQL query designer and its two modes: graphical and textual. This other thread will cover the graphical mode in more detail, reviewing each clause of SQL's SELECT statement.
The implementation of our new SELECT transform triggered many fascinating discussions at #STOIC. Among them, one was trying to answer the following question: should we encourage users to use SELECT as much as possible, or should we steer them toward simpler Tabula transforms?
For reference purposes, here are Tabula transforms:
By design, Tabula transforms are easier to use and easier to distribute. On the flip side, SQL is an industry standard, and much, much more powerful from a functional and semantic standpoint. But it's also much more difficult to use, and very challenging to distribute.