12,399 views

Jeremy Daly

@jeremy_daly

, 28 tweets, 8 min read

My Authors

@alexbdebrie

@alexbdebrie

.@alexbdebrie has another excellent post that details the benefits and downsides of using a Single-Table design with @DynamoDB. While I completely agree with him on the “benefits”, I have some thoughts on his “downsides” that I’d like to address. 🧵 alexdebrie.com/posts/dynamodb…

Downside #1: “The steep learning curve to understand single-table design”

There is no doubt that “thinking in #NoSQL” is a complete departure from traditional #RDBMS, but understanding how to correctly denormalize data is applicable to both single- AND multi-table designs.

@DynamoDB

@DynamoDB

If you are using a multi-table design in @DynamoDB that implements 3NF, then just STOP! Seriously, this is *beyond* wrong (I think presidents have been impeached for this). This is not what #NoSQL was designed for and you will get ZERO benefit from doing this. Spin up an RDBMS.

So, therefore, if you are using a multi-table design and are denormalizing your data, then the same principles of SINGLE table design apply. Every item in your table *MUST* have a PARTITION KEY and a SORT KEY! (If you don’t have an SK then DynamoDB becomes a simple K/V)

Think of it this way: your PARTITION KEY is the FOLDER, and your SORT KEY is the ITEM within that folder. Different entity types are already *logically* separated in a single table, using multiple tables to *physically* separate them is like storing data on another hard drive.

With any wide column #NoSQL database, the attributes of each ITEM are *not* bound by any schema or forced set of columns (other than PK and SK). This means that different ENTITY types *DO NOT* even need to share attribute names (and probably shouldn’t).

Therefore, physically separating data into tables with homogenous schemas is not only unnecessary, but it’s inefficient, requires more configuration & capacity planning (sans on-demand pricing), adds additional complexity to the application layer, and IMO, is just plain wrong.

Even if you are not taking advantage of relational modeling in #NoSQL, you can still store all your entities in a single table. IF you start with this approach, you can likely evolve to a relational model by adding a few GSIs. Which takes me to the next “downside”…

“The inflexibility of adding new access patterns”

For the longest time I was firmly in this camp. I'd pour over single-table designs trying to reuse GSIs w/ generic attributes across entities, utilize inverted indexes, and implement composite keys. Then came re:Invent 2019.

@houlihan_rick

@houlihan_rick

During @houlihan_rick's 2019 talk on Advanced Design Patterns for DynamoDB, I picked up on an evolution of #NoSQL modeling that completely changed the way I approach single-table designs: “denormalization *WITHIN* an item.”

Let me explain. There are a number of tricks and techniques (like inverted indexes, adjacency lists, etc.) that have been used to build some spectacular relational models. The problem is that the normal implementation of these methods are *very inflexible*.

This is typically because we REUSE common attributes. With inverted indexes, for example, we create a GSI that flips the PK and the SK. Since every item has a PK and an SK, we are replicating every item to the index, leaving us with lots of partitions that have little value.

An SK value of “1581006801” that becomes our PARTITION KEY on our GSI is likely useless to us. We are wasting storage space and WCUs. This often leads us down the path of trying to make *ALL* SKs meaningful keys on the GSI. This is hard, confusing, and ultimately a fool’s errand.

But there’s a better way! Instead of adding a GSI to an attribute that’s trying to serve multiple purposes, add it to a *single purpose attribute* and populate it with a value that represents a meaningful PARTITION KEY on the GSI. And do the same thing for the SORT KEY!

In the video above, you’ll see that Rick has attributes named GSI1pk, GSIsk, GSI2pk, GSI2sk, etc. He then copies data from the *same* item into those attributes so that he has FULL CONTROL over the items in those indexes. This also allows for these indexes to be SPARSE INDEXES.

SPARSE INDEXES *only* copy items that have attributes with data in the PARTITION and SORT keys defined with the GSI. This means we can be very efficient with our indexes, replicating only the items we actually need. When we add items to the table, we populate these attributes.

What about flexibility? Well, we know we can easily add and remove GSIs from our table. So using this technique, we can run batch jobs that decorate items with additional attributes that will add them to (or remove them from) indexes. These could be existing GSIs, or new ones.

If we think about this as simply an exercise in adding items to an index, then this approach becomes straightforward. Is it a pain to run a batch job to decorate a bunch of items? Sure. But isn’t it also a pain to add indexes to an RDBMS table and wait hours for it to complete?

I will say that it is decidedly “less easy” to change values on the primary index (because you need to delete and reinsert the item), but it is NOT impossible. Plus, changing a data model is never easy, so if you choose meaningful PKs and SKs from the start, you should be good.

The other thing about this approach is that we can still implement all those fancy relational modeling techniques. Not only do we have *more* control over them, but we can even implement them *after the fact*. Plus, a few extra attributes are cheaper than copying the whole table.

And by the way, the vast majority of this flexibility is lost if you take the multi-table approach. Store all your entities in ONE SINGLE TABLE, and you open yourself up to lots of really cool possibilities.

And finally…

“The difficulty of exporting your tables for analytics”

Hmm, how about adding a field to each item that identifies its ENTITY TYPE. Problem solved. 😉

But seriously, I think it’s MORE difficult if you have multiple tables. Here’s why…

First of all, there are several ways to get data out of DynamoDB and into some other purposeful built tool for analytics. If you have low latency requirements, DynamoDB streams are the obvious choice. And the easiest way to consume DynamoDB streams are with Lambda functions…

Every DynamoDB table has its own STREAM, which means you need to attach a Lambda function to each one. This either means I need to create, maintain, and pay for separate Lambda functions for each stream, or I build a single Lambda function that can discern between streams.

But if I follow the latter approach, I can easily use that ENTITY TYPE attribute to discern records from a SINGLE stream, rather than adding the complexity of additional streams. Plus, if I add NEW entity types, I don’t need to create a new table, a new stream, and a new Lambda.

Now, there *may* be cases where the throughput of a single stream might not be enough to handle the record velocity. If that’s truly the case, then we are likely talking about a highly specialized app that goes well beyond typical modeling. There are other ways to handle these.

@alexdebrie

@alexdebrie

So, in conclusion, I want to reiterate that I absolutely love the work that @alexdebrie is doing for both @dynamodb and the #serverless community. Definitely check out his blog, buy his book when it comes out, and follow him on Twitter. But… 😬

I don’t think these are “downsides”, I think they’re excuses that developers are using to implement #NoSQL incorrectly. There are absolutely times when separate tables (like time series data) makes sense. But in the VAST majority of cases, MULTI-TABLE is just plain WRONG. 🤷‍♂️

Enjoying this thread?

Try unrolling a thread yourself!

Trending hashtags

Enjoying this thread?

Try unrolling a thread yourself!

Related hashtags

More from @jeremy_daly see all

Related threads

Trending hashtags

Did Thread Reader help you today?