Ismael Ghalimi Profile picture
Renaissance steampunk toolmaker
Feb 1, 2023 18 tweets 5 min read
@haro_ca_ had some interesting questions about OLTP and @DuckDB. This is a vast topic that will require a lot more than a single thread, but let me take a first crack at it. The main idea is how to handle OLAP queries on mutable data. OLAP queries on immutable data are pretty straightforward. But as soon as the data can mutate, things start getting interesting. That's what I'm going to focus on with this thread.
Jan 31, 2023 7 tweets 2 min read
When I first came to the Bay Area in 1999, the commercial open source company I founded was the second contributor to the Apache Software Foundation after IBM in terms of number of top-level projects. It failed because of my inexperience and stupidity. For close to 20 years, I was not involved in any major open source project, because I did not think that I could contribute anything of value. I really hope to change that moving forward.
Jan 31, 2023 6 tweets 2 min read
@DuckDB calls itself "an in-process
SQL OLAP database management system", but this tagline can be quite misleading. Because the @DuckDBLabs are modest (and smart), they omit the fact that it has everything needed to make it an OLTP database as well... When you read this page, you learn that "DuckDB provides transactional guarantees (ACID properties) through our custom, bulk-optimized Multi-Version Concurrency Control (MVCC)."

duckdb.org/why_duckdb
Jan 31, 2023 8 tweets 2 min read
I keep mentioning @RedisInc in the context of @PuffinDB, but I've never taken the time to really explain the role it is going to play, and why we have decided to use it. In my opinion, Redis is one of the best pieces of software technology out there. It's really fast, really solid, really well designed, and really well aged. It's really, really good. For the longest time, and it was the most well-liked database, for good reasons.
Jan 30, 2023 5 tweets 2 min read
We had our bi-weekly call with @duckdblabs this morning. Good news all around: @duckdb's SQL parser will soon be exposed through the SQL API, generating an Abstract Syntax Tree (AST) from a SQL query as a JSON. Next, they'll do the same for generating a full relational tree. These two features will be critical for implementing our distributed query planner. And we confirmed that we'll be able to use @substrait_io for exchanging physical query plans from the distributed query engine to @duckdb nodes.
Jan 16, 2023 15 tweets 3 min read
Now that we have a pretty good idea of what the UI will look like, let's focus on the runtime side of things, especially when data gets too large to fit within a single host (2 TB on a host with 4 TB of RAM). In other words, how will the distributed query planner work? The assumption is that source data is partitioned as a set of Parquet files (with or without Iceberg). From there, if your query is map-reducible, you can use @DuckDB's COMBINE to map it on a fleet of Lambdas, then reduce it on an EC2 VM. That's what #STOIC does today.
Jan 16, 2023 4 tweets 2 min read
Rules of engagement: this Twitter account is my personal account. I use it to liberally share many of the things that we do at #STOIC. While this leaks some of our IP, I tend to think that we gain more than we lose by doing so, because it helps us engage with the community. I use my long threads to capture my stream of consciousness. Writing my ideas down helps me think them through, even if nobody reads them on the other end. This makes for a very painful account to follow, and most followers end up tuning out eventually, but I don't mind.
Jan 16, 2023 23 tweets 8 min read
Here is how #STOIC intends to use this feature. Obviously, there are many applications for a native SQL parser/serializer, but this particular one might give you some ideas for your own projects. Here is a screenshot of our current UI. For us, SELECT is a transform that can be used in a data journey to generate a new table, either directly from an import, or downstream of other transforms.
Jan 16, 2023 4 tweets 2 min read
Public announcement: if your company is using @DuckDB, you should consider sponsoring @DuckDBLabs (these folks are great to work with). And if you do and your needs are aligned with #STOIC's, we should have a chat about priorities and design requirements. If we pool resources together, we might be able to fund things that would be out of reach for #STOIC on its own, or will take a lot longer to develop, for lack of sufficient resources.
Jan 15, 2023 61 tweets 16 min read
Yesterday, I described a version of our upcoming SQL query designer that focused on making it easier to compose SQL queries, while preserving SQL's hierarchical structure. Today, I want to explore an alternative path. Instead of taking a SQL-oriented approach, I want to take a task-oriented approach. And I want to free myself from SQL's hierarchical structure, while still producing a well-formed SQL query in the end.
Jan 15, 2023 18 tweets 6 min read
One of #STOIC's most useful features is its signature Summary Charts, which are these bar charts displayed at the top of every column in a table. They work really well, unless your table has very few rows. Here is how we'll improve them for some interesting corner cases. Relative Baseline in Bar Plots

When a table has less than 50 rows, we replace the Histogram displayed for a numerical column with a Bar Plot visualizing discrete values, but we keep 0 as baseline. We should use MIN or MAX instead, as we do for cell summaries below.
Jan 15, 2023 7 tweets 2 min read
Things that I love about @berkeleygfx's design philosophy:

Emergent over prescribed aesthetics.
Expose state and inner workings.
Dense, not sparse.
Performance is design.
Beauty emerges automatically without deliberation.
Do not infantilize users. "Emergent over prescribed aesthetics."

Our core UI design was totally emergent. We did not follow any trends. We just tried to render as much qualitative and quantitative information as possible, in a well-organized fashion. Aesthetics were a mere by-product.
Jan 15, 2023 5 tweets 1 min read
One of the main factors that is holding back the design of good data tools for less technical users is our lack of trust in their ability to understand complex mathematical concepts like logarithmic scales or percentiles. While it is true that the vast majority of such users are not yet familiar or comfortable with these concepts today, they could learn about them over time, and we need to design user interfaces that can facilitate this learning.
Jan 14, 2023 53 tweets 14 min read
This thread introduced our upcoming SQL query designer and its two modes: graphical and textual. This other thread will cover the graphical mode in more detail, reviewing each clause of SQL's SELECT statement. As a baseline, we will use this @duckdb documentation for SQL's SELECT:

duckdb.org/docs/sql/state…
Jan 14, 2023 34 tweets 8 min read
This thread introduced the SQL vs. Tabula discussion. This other thread will cover our SQL query designer. This tool will focus on @duckdb's SQL dialect, but it will be designed to work with virtually any SQL cloud database (@SnowflakeDB, @databricks, etc). There are many ways to design a SQL query designer, and to design a good one, two critical questions must be answered first: who is the audience, and what is their working environment?
Jan 14, 2023 26 tweets 6 min read
The implementation of our new SELECT transform triggered many fascinating discussions at #STOIC. Among them, one was trying to answer the following question: should we encourage users to use SELECT as much as possible, or should we steer them toward simpler Tabula transforms? Image For reference purposes, here are Tabula transforms: ImageImage
Jan 14, 2023 21 tweets 8 min read
Now that we have tested all TPC-H queries with our new SELECT transform powered by @duckdb, let's review how #STOIC will produce high-fidelity results for these query results. Image One of the most critical design elements that make #STOIC work the way it does is its rich data typing system (Principia Data): each and every column is defined with a precise datatype, which conditions how it is serialized, processed, and analyzed.

stoic.com/principia/data… Image
Jan 13, 2023 4 tweets 3 min read
Question for SQL experts: #STOIC has a much richer typing system than @duckdb. Therefore, when we create a column like "revenue", we want to type it (Number) and give it a full name (Total Revenue). What would be the best way to add that information directly in the SQL code? Image To be clear, we need a solution that would be compatible with @duckdb's SQL parser and would be as idiomatic as possible. Otherwise, we'll have to invent our own syntax and do some pre-processing. That's always an option, but I'd like to avoid it if possible.
Jan 13, 2023 26 tweets 7 min read
We're now ready to test all 22 queries defined by TPC-H.

github.com/electrum/tpch-…

Q1

Looking good... Image Q2

I love nested SELECT statements! ImageImage
Nov 28, 2022 19 tweets 6 min read
Of all the architecture decisions we made for STOIC, the riskiest was probably the adoption of @github as underlying collaboration engine. After all, most of our users are business people, and most business people have yet to sign up for the service.

Fortunately, it worked. To better understand why things worked out the way they did, let me show you how we integrated with GitHub. Here is what a STOIC workbook looks like. Very similar to Excel, with sheets (we call them pages) at the bottom of the screen.
Nov 28, 2022 5 tweets 2 min read
To our surprise, the idea of deploying our software on the customer's VPC as a CloudFormation template, and charging for a multiple of whatever the customer is charged by @awscloud for compute workloads generated by our software is relatively unusual. So much so that it has taken us over three months and half a dozen meetings with various @awscloud teams to figure out the right way to implement this pricing model using existing infrastructure. But I am willing to bet that it will become more and more popular...