Ismael Ghalimi Profile picture
Jan 15, 2023 18 tweets 6 min read Read on X
One of #STOIC's most useful features is its signature Summary Charts, which are these bar charts displayed at the top of every column in a table. They work really well, unless your table has very few rows. Here is how we'll improve them for some interesting corner cases.
Relative Baseline in Bar Plots

When a table has less than 50 rows, we replace the Histogram displayed for a numerical column with a Bar Plot visualizing discrete values, but we keep 0 as baseline. We should use MIN or MAX instead, as we do for cell summaries below.
Bar Plot with All Equal Values

When a Bar Plot is produced for a set of values that are all equal, we would want to know that at a glance. To do so, the length of bars will be reduced by 50%, while keeping the plot horizontally centered.
Bar Plot with All Values Equal to 0

When all values are equal to 0, the length of bars should be 2px (instead of 1px as we do here). This allows to communicate that all values are equal, and that they're equal to 0 (very common case).
Bar Plot with All Values Equal to 1

When all values are equal to 1, bars should become squares (as we do here). This allows to communicate that all values are equal, and that they're equal to 1 (very common case as well).
Single Value

When you have a single value, you essentially have very little information to visualize, and we want to communicate that fact clearly. Therefore, we'll replace the single bar by a smaller square, horizontally and vertically centered.
Furthermore, the cell summary shown below contains zero information, and we want to communicate that fact as well. Therefore, we'll replace it with a 2px-wide tick, left-aligned for positive values, centered for 0, and right-aligned for negative values.
Single Bin

Whenever a histogram has a single bin, its single bar should have half the width of the column, in order to properly convey the information that we have a vertical bar, not a horizontal one.

PS: Don't pay attention to the "Single value" title (bug).
Bar Plot with Two Values

When a Bar Plot renders two values, things get interesting. If you use a non-zero baseline, the top bar will show 100%, and the bottom one 0% (on a relative scale). Therefore, your Bar Plot ends up visualizing no information at all.
As a result, we must keep 0 as baseline, as we did on the previous screenshot. But when doing so, we should remove the range bar, because it is superfluous, as it always starts where the bottom bar stops, by design.
Baseline Value Cell Summary

Whenever we use a non-zero baseline for cell summaries, cells with values equal to the baseline (MIN for positive, MAX for negative) are shown with no bar. They should be shown with a 2px-wide bar instead to show that we have data.
Percentile Ticks

Percentile ticks in cell summaries should be displayed with a darker color to be more readable. But most importantly, they should have a thickness equal to 1% of the full range's width (100%), because they represent a 1% bin.
Others in Frequency of Frequency Charts

Frequency of Frequency Charts with more than 25 frequencies should display an "Others" bars at the bottom, like we do for all other summary charts.
Frequency Chart with All Equal Values

Frequency Charts with all equal values should have bars with length equal to 50% of the full range, while the bar chart remains horizontally centered (same as what should be done for Bar Plots).
Bottomline: univariate summary charts are devilishly tricky. They're all bar charts, but they're bar charts of many different kinds (bar plots, histograms, frequency charts, frequency of frequency charts), applied to the data or to metadata on data (e.g. length of string).
And depending on the number of rows in the table, the number of bars in the chart, and whether numerical values are signed or not, many different rules must be applied for the chart to make any sense at all. Some of these rules are obvious, but many are not.
Most importantly, some of these rules reflect common practices, but many are innovative and have yet to stand the test of time (e.g. range bars for non-zero baselines). This is making for a very challenging project.
Fortunately, we're finally seeing light at the end of the tunnel...

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ismael Ghalimi

Ismael Ghalimi Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ghalimi

Feb 1, 2023
@haro_ca_ had some interesting questions about OLTP and @DuckDB. This is a vast topic that will require a lot more than a single thread, but let me take a first crack at it.
The main idea is how to handle OLAP queries on mutable data. OLAP queries on immutable data are pretty straightforward. But as soon as the data can mutate, things start getting interesting. That's what I'm going to focus on with this thread.
When talking about mutable data, one has to distinguish different use cases:

1. Inserting new rows of immutable data (e.g. logs).
2. Updating existing rows.
3. Deeply transforming the data.
Read 18 tweets
Jan 31, 2023
When I first came to the Bay Area in 1999, the commercial open source company I founded was the second contributor to the Apache Software Foundation after IBM in terms of number of top-level projects. It failed because of my inexperience and stupidity.
For close to 20 years, I was not involved in any major open source project, because I did not think that I could contribute anything of value. I really hope to change that moving forward.
My main problem with open source is that it is really difficult to create a sustainable business model around it. You start with a liberal license and good intentions, you then realize that you need to make money and switch to AGPL, and the community ends up fragmenting itself.
Read 7 tweets
Jan 31, 2023
@DuckDB calls itself "an in-process
SQL OLAP database management system", but this tagline can be quite misleading. Because the @DuckDBLabs are modest (and smart), they omit the fact that it has everything needed to make it an OLTP database as well...
When you read this page, you learn that "DuckDB provides transactional guarantees (ACID properties) through our custom, bulk-optimized Multi-Version Concurrency Control (MVCC)."

duckdb.org/why_duckdb
And if you know anything about transactions, you must know that MVCC is a really big deal. And doing that as part of an in-process vectorized SQL engine is... well... groundbreaking. There is nothing like that out there...
Read 6 tweets
Jan 31, 2023
I keep mentioning @RedisInc in the context of @PuffinDB, but I've never taken the time to really explain the role it is going to play, and why we have decided to use it.
In my opinion, Redis is one of the best pieces of software technology out there. It's really fast, really solid, really well designed, and really well aged. It's really, really good. For the longest time, and it was the most well-liked database, for good reasons.
What's less well understood is how (and why) Redis could (and should) be used for building another database on top of it, or a low-latency alternative to Apache Iceberg.
Read 8 tweets
Jan 30, 2023
We had our bi-weekly call with @duckdblabs this morning. Good news all around: @duckdb's SQL parser will soon be exposed through the SQL API, generating an Abstract Syntax Tree (AST) from a SQL query as a JSON. Next, they'll do the same for generating a full relational tree.
These two features will be critical for implementing our distributed query planner. And we confirmed that we'll be able to use @substrait_io for exchanging physical query plans from the distributed query engine to @duckdb nodes.
We discussed our ideas regarding the distributed query planner (YAML DSL with TypeScript, dynamically-injected optimizer rules, distribution of planning across multiple Lambda functions), and no particular issues were raised. The project is ambitious for sure, but seems doable.
Read 5 tweets
Jan 16, 2023
Now that we have a pretty good idea of what the UI will look like, let's focus on the runtime side of things, especially when data gets too large to fit within a single host (2 TB on a host with 4 TB of RAM). In other words, how will the distributed query planner work?
The assumption is that source data is partitioned as a set of Parquet files (with or without Iceberg). From there, if your query is map-reducible, you can use @DuckDB's COMBINE to map it on a fleet of Lambdas, then reduce it on an EC2 VM. That's what #STOIC does today.
Things start getting more complicated if you have a SELECT within a FROM clause, a HAVING clause, or inner queries (correlated or uncorrelated). Let's review these scenarios one by one.
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(