Tweet

Ismael Ghalimi

Jan 14 • 21 tweets • 8 min read

@duckdb

Now that we have tested all TPC-H queries with our new SELECT transform powered by @duckdb, let's review how #STOIC will produce high-fidelity results for these query results.

One of the most critical design elements that make #STOIC work the way it does is its rich data typing system (Principia Data): each and every column is defined with a precise datatype, which conditions how it is serialized, processed, and analyzed.

stoic.com/principia/data…

Here is the full list of datatypes:

github.com/stoic-doc/Comm…

Whenever we use a Tabula transform other than SELECT, we automatically produce a high-fidelity schema for the resulting table. This is really effective, and one of the main reasons why Tabula is so attractive (alongside its ease of distribution).

STOIC also distinguishes column identifiers from column names. For example, a column might be identified as "o_totalprice", but displayed as "Total Price". This keeps SQL code as clean as possible, while making table outputs as user-friendly as possible.

@duckdb

Currently, outputs of the SELECT transform benefit from neither. Columns are typed by @duckdb automatically, and identifiers and names are the same.. As a result, we get this kind of table:

There really is nothing wrong about this table, but it's not as good as it could be. Instead, I would like richer datatypes, and user-friendly column names. Thankfully, there is a simple solution to fix all this.

If we look at the SELECT clause of the SELECT transform, we see that 2 out of 10 columns could have their names and datatypes automatically inherited from upstream tables (l_returnflag, l_linestatus).

@duckdb

And for the 8 others, we can add inline comments that will give columns proper names and datatypes. In this case, all the datatypes inferred by @duckdb are correct, but we could override them with something like type:level.

What is the "level" datatype you might ask? It's the datatype for a continuous numerical variable that cannot be summed. For example, a temperature. And by default, an average should always be typed as a level.

github.com/stoic-doc/Comm…

And its discrete counterpart is an index:

github.com/stoic-doc/Comm…

And as I was writing these lines, I realized that I forgot to properly type my averages. Here is the fix...

Using inline comments in the SQL code will allow data analysts to provide this information directly within the SELECT transform, instead of having to use a CONFIGURE transform downstream of the SELECT transform. This will keep the data journey as lean as possible.

It will also make the SELECT transform fully self-contained, which will become very important when we add support for standalone Queries packaged as JSON files, using our upcoming File page type. Currently, we only have Journeys, Sheets, and Notebooks:

Meta data added as inline comments within the SQL code will use a very simple syntax that will be trivial to parse: key-value pairs separated with a comma and an optional space, with key and value separated by a colon and optional spaces.

Keys will be user-defined, and we will support the following ones:

- name
- type
- desc

@duckdb

With inline comments, we make this technique database agnostic, which means that it will work with any SQL dialect, instead of being limited to @duckdb's. This is important, because the SQL we generate can be pushed down to an upstream database like @SnowflakeDB or @databricks.

The only technical challenge is that we need to parse the contents of the SELECT clause. Fortunately, SELECT clauses are relatively simple, and we're just looking for patterns like:

AS alias /* ... */

Therefore, we can do it without having to rely on a full SQL parser.

Producing proper schemas for SELECT results will be one of our main projects for the upcoming week. In parallel, we will add support for auto-completion, at least for upstream table and column identifiers.

After that, we will make the UI for SELECT more point-and-click, with nice selectors for tables and columns, as well as graphical editors for editing the predicates of WHERE clauses. But we'll still preserve the ability to write SQL code by hand.

https://twitter.com/ghalimi/status/1612610728359493632

Graphical clause editors will require a proper SQL parser though.

https://twitter.com/ghalimi/status/1612610728359493632

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @ghalimi

Ismael Ghalimi

@ghalimi

Jan 15

https://twitter.com/ghalimi/status/1614266907666055169

Yesterday, I described a version of our upcoming SQL query designer that focused on making it easier to compose SQL queries, while preserving SQL's hierarchical structure. Today, I want to explore an alternative path.

https://twitter.com/ghalimi/status/1614266907666055169

Instead of taking a SQL-oriented approach, I want to take a task-oriented approach. And I want to free myself from SQL's hierarchical structure, while still producing a well-formed SQL query in the end.

The idea is similar to what Excel's pivot table is doing: give the user a relatively simple graphical user interface to pivot a table across multiple dimensions, and produce a query from it (be it a SQL or a DAX query).

Read 61 tweets

Ismael Ghalimi

@ghalimi

Jan 15

One of #STOIC's most useful features is its signature Summary Charts, which are these bar charts displayed at the top of every column in a table. They work really well, unless your table has very few rows. Here is how we'll improve them for some interesting corner cases.

Relative Baseline in Bar Plots

When a table has less than 50 rows, we replace the Histogram displayed for a numerical column with a Bar Plot visualizing discrete values, but we keep 0 as baseline. We should use MIN or MAX instead, as we do for cell summaries below.

Bar Plot with All Equal Values

When a Bar Plot is produced for a set of values that are all equal, we would want to know that at a glance. To do so, the length of bars will be reduced by 50%, while keeping the plot horizontally centered.

Read 18 tweets

Ismael Ghalimi

@ghalimi

Jan 15

@berkeleygfx

Things that I love about @berkeleygfx's design philosophy:

Emergent over prescribed aesthetics.
Expose state and inner workings.
Dense, not sparse.
Performance is design.
Beauty emerges automatically without deliberation.
Do not infantilize users.

"Emergent over prescribed aesthetics."

Our core UI design was totally emergent. We did not follow any trends. We just tried to render as much qualitative and quantitative information as possible, in a well-organized fashion. Aesthetics were a mere by-product.

"Expose state and inner workings."

That's exactly what we do when we expose the "Runtime Boundary" on the left panel: Lambda functions upstream, EC2 monostore downstream.

Read 7 tweets

Ismael Ghalimi

@ghalimi

Jan 15

One of the main factors that is holding back the design of good data tools for less technical users is our lack of trust in their ability to understand complex mathematical concepts like logarithmic scales or percentiles.

While it is true that the vast majority of such users are not yet familiar or comfortable with these concepts today, they could learn about them over time, and we need to design user interfaces that can facilitate this learning.

This is why I am such a strong proponent of what I call progressive user interfaces. These are user interfaces that use advanced concepts surreptitiously. These concepts are there right in front of the user, yet might not be noticed at first, and don't get in the way.

Read 5 tweets

Ismael Ghalimi

@ghalimi

Jan 14

https://twitter.com/ghalimi/status/1614244954657927169

This thread introduced our upcoming SQL query designer and its two modes: graphical and textual. This other thread will cover the graphical mode in more detail, reviewing each clause of SQL's SELECT statement.

https://twitter.com/ghalimi/status/1614244954657927169

@duckdb

As a baseline, we will use this @duckdb documentation for SQL's SELECT:

duckdb.org/docs/sql/state…

This gives us the following clauses:

WITH
SELECT
FROM
USING
WHERE
GROUP BY
WINDOW
ORDER BY
LIMIT

Read 53 tweets

Ismael Ghalimi

@ghalimi

Jan 14

@duckdb

This thread introduced the SQL vs. Tabula discussion. This other thread will cover our SQL query designer. This tool will focus on @duckdb's SQL dialect, but it will be designed to work with virtually any SQL cloud database (@SnowflakeDB, @databricks, etc).

https://twitter.com/ghalimi/status/1614234050981658624

There are many ways to design a SQL query designer, and to design a good one, two critical questions must be answered first: who is the audience, and what is their working environment?

For #STOIC, we must serve two very different user groups:

1. Users who are new to SQL
2. Users who are SQL experts

Read 34 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Ismael Ghalimi

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @ghalimi

Ismael Ghalimi

Ismael Ghalimi

Ismael Ghalimi

Ismael Ghalimi

Ismael Ghalimi

Ismael Ghalimi

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!