Latest Twitter Threads by @ghalimi on Thread Reader App

Feb 1, 2023 • 18 tweets • 5 min read

@haro_ca_ had some interesting questions about OLTP and @DuckDB. This is a vast topic that will require a lot more than a single thread, but let me take a first crack at it.

https://twitter.com/haro_ca_/status/1620812999266287616

The main idea is how to handle OLAP queries on mutable data. OLAP queries on immutable data are pretty straightforward. But as soon as the data can mutate, things start getting interesting. That's what I'm going to focus on with this thread.

Jan 31, 2023 • 7 tweets • 2 min read

When I first came to the Bay Area in 1999, the commercial open source company I founded was the second contributor to the Apache Software Foundation after IBM in terms of number of top-level projects. It failed because of my inexperience and stupidity. For close to 20 years, I was not involved in any major open source project, because I did not think that I could contribute anything of value. I really hope to change that moving forward.

Jan 31, 2023 • 6 tweets • 2 min read

@DuckDB calls itself "an in-process
SQL OLAP database management system", but this tagline can be quite misleading. Because the @DuckDBLabs are modest (and smart), they omit the fact that it has everything needed to make it an OLTP database as well... When you read this page, you learn that "DuckDB provides transactional guarantees (ACID properties) through our custom, bulk-optimized Multi-Version Concurrency Control (MVCC)."

duckdb.org/why_duckdb

Jan 31, 2023 • 8 tweets • 2 min read

I keep mentioning @RedisInc in the context of @PuffinDB, but I've never taken the time to really explain the role it is going to play, and why we have decided to use it. In my opinion, Redis is one of the best pieces of software technology out there. It's really fast, really solid, really well designed, and really well aged. It's really, really good. For the longest time, and it was the most well-liked database, for good reasons.

Jan 30, 2023 • 5 tweets • 2 min read

We had our bi-weekly call with @duckdblabs this morning. Good news all around: @duckdb's SQL parser will soon be exposed through the SQL API, generating an Abstract Syntax Tree (AST) from a SQL query as a JSON. Next, they'll do the same for generating a full relational tree. These two features will be critical for implementing our distributed query planner. And we confirmed that we'll be able to use @substrait_io for exchanging physical query plans from the distributed query engine to @duckdb nodes.

Jan 16, 2023 • 15 tweets • 3 min read

Now that we have a pretty good idea of what the UI will look like, let's focus on the runtime side of things, especially when data gets too large to fit within a single host (2 TB on a host with 4 TB of RAM). In other words, how will the distributed query planner work?

https://twitter.com/ghalimi/status/1615065408599654401

The assumption is that source data is partitioned as a set of Parquet files (with or without Iceberg). From there, if your query is map-reducible, you can use @DuckDB's COMBINE to map it on a fleet of Lambdas, then reduce it on an EC2 VM. That's what #STOIC does today.

Jan 16, 2023 • 4 tweets • 2 min read

Rules of engagement: this Twitter account is my personal account. I use it to liberally share many of the things that we do at #STOIC. While this leaks some of our IP, I tend to think that we gain more than we lose by doing so, because it helps us engage with the community. I use my long threads to capture my stream of consciousness. Writing my ideas down helps me think them through, even if nobody reads them on the other end. This makes for a very painful account to follow, and most followers end up tuning out eventually, but I don't mind.

Jan 16, 2023 • 23 tweets • 8 min read

Here is how #STOIC intends to use this feature. Obviously, there are many applications for a native SQL parser/serializer, but this particular one might give you some ideas for your own projects.

https://twitter.com/ghalimi/status/1615028240615370757

Here is a screenshot of our current UI. For us, SELECT is a transform that can be used in a data journey to generate a new table, either directly from an import, or downstream of other transforms.

Jan 16, 2023 • 4 tweets • 2 min read

Public announcement: if your company is using @DuckDB, you should consider sponsoring @DuckDBLabs (these folks are great to work with). And if you do and your needs are aligned with #STOIC's, we should have a chat about priorities and design requirements. If we pool resources together, we might be able to fund things that would be out of reach for #STOIC on its own, or will take a lot longer to develop, for lack of sufficient resources.

Jan 15, 2023 • 61 tweets • 16 min read

Yesterday, I described a version of our upcoming SQL query designer that focused on making it easier to compose SQL queries, while preserving SQL's hierarchical structure. Today, I want to explore an alternative path.

https://twitter.com/ghalimi/status/1614266907666055169

Instead of taking a SQL-oriented approach, I want to take a task-oriented approach. And I want to free myself from SQL's hierarchical structure, while still producing a well-formed SQL query in the end.

Jan 15, 2023 • 18 tweets • 6 min read

One of #STOIC's most useful features is its signature Summary Charts, which are these bar charts displayed at the top of every column in a table. They work really well, unless your table has very few rows. Here is how we'll improve them for some interesting corner cases.

Relative Baseline in Bar Plots

When a table has less than 50 rows, we replace the Histogram displayed for a numerical column with a Bar Plot visualizing discrete values, but we keep 0 as baseline. We should use MIN or MAX instead, as we do for cell summaries below.

Jan 15, 2023 • 7 tweets • 2 min read

Things that I love about @berkeleygfx's design philosophy:

Emergent over prescribed aesthetics.
Expose state and inner workings.
Dense, not sparse.
Performance is design.
Beauty emerges automatically without deliberation.
Do not infantilize users. "Emergent over prescribed aesthetics."

Our core UI design was totally emergent. We did not follow any trends. We just tried to render as much qualitative and quantitative information as possible, in a well-organized fashion. Aesthetics were a mere by-product.

Jan 15, 2023 • 5 tweets • 1 min read

One of the main factors that is holding back the design of good data tools for less technical users is our lack of trust in their ability to understand complex mathematical concepts like logarithmic scales or percentiles. While it is true that the vast majority of such users are not yet familiar or comfortable with these concepts today, they could learn about them over time, and we need to design user interfaces that can facilitate this learning.

Jan 14, 2023 • 53 tweets • 14 min read

This thread introduced our upcoming SQL query designer and its two modes: graphical and textual. This other thread will cover the graphical mode in more detail, reviewing each clause of SQL's SELECT statement.

https://twitter.com/ghalimi/status/1614244954657927169

As a baseline, we will use this @duckdb documentation for SQL's SELECT:

duckdb.org/docs/sql/state…

Jan 14, 2023 • 34 tweets • 8 min read

This thread introduced the SQL vs. Tabula discussion. This other thread will cover our SQL query designer. This tool will focus on @duckdb's SQL dialect, but it will be designed to work with virtually any SQL cloud database (@SnowflakeDB, @databricks, etc).

https://twitter.com/ghalimi/status/1614234050981658624

There are many ways to design a SQL query designer, and to design a good one, two critical questions must be answered first: who is the audience, and what is their working environment?

Jan 14, 2023 • 26 tweets • 6 min read

The implementation of our new SELECT transform triggered many fascinating discussions at #STOIC. Among them, one was trying to answer the following question: should we encourage users to use SELECT as much as possible, or should we steer them toward simpler Tabula transforms?

For reference purposes, here are Tabula transforms:

Jan 14, 2023 • 21 tweets • 8 min read

Now that we have tested all TPC-H queries with our new SELECT transform powered by @duckdb, let's review how #STOIC will produce high-fidelity results for these query results.

One of the most critical design elements that make #STOIC work the way it does is its rich data typing system (Principia Data): each and every column is defined with a precise datatype, which conditions how it is serialized, processed, and analyzed.

stoic.com/principia/data…

Jan 13, 2023 • 4 tweets • 3 min read

Question for SQL experts: #STOIC has a much richer typing system than @duckdb. Therefore, when we create a column like "revenue", we want to type it (Number) and give it a full name (Total Revenue). What would be the best way to add that information directly in the SQL code?

To be clear, we need a solution that would be compatible with @duckdb's SQL parser and would be as idiomatic as possible. Otherwise, we'll have to invent our own syntax and do some pre-processing. That's always an option, but I'd like to avoid it if possible.

Jan 13, 2023 • 26 tweets • 7 min read

We're now ready to test all 22 queries defined by TPC-H.

github.com/electrum/tpch-…

Q1

Looking good...

Q2

I love nested SELECT statements!

Nov 28, 2022 • 19 tweets • 6 min read

Of all the architecture decisions we made for STOIC, the riskiest was probably the adoption of @github as underlying collaboration engine. After all, most of our users are business people, and most business people have yet to sign up for the service.

Fortunately, it worked. To better understand why things worked out the way they did, let me show you how we integrated with GitHub. Here is what a STOIC workbook looks like. Very similar to Excel, with sheets (we call them pages) at the bottom of the screen.

Nov 28, 2022 • 5 tweets • 2 min read

To our surprise, the idea of deploying our software on the customer's VPC as a CloudFormation template, and charging for a multiple of whatever the customer is charged by @awscloud for compute workloads generated by our software is relatively unusual. So much so that it has taken us over three months and half a dozen meetings with various @awscloud teams to figure out the right way to implement this pricing model using existing infrastructure. But I am willing to bet that it will become more and more popular...

Share this page!

Enter URL or ID to Unroll