A few things I think were done right in that Snowflake paper in 2016 (reporting on 4-year old work!):
1. They got the ergonomics of the cloud early (for the space). Elasticity is a core feature. Storage is no longer a 'you' problem. Cost is compute-as-you go.
2. They had ACID out-of-the-box. This was a huge step forward over the state-of-the-art MPP analytical engines. It helps to have some of the best database engineers in the world on your founding team.
3. They understood that SaaS was the right model and weren't constrained trying to build a solution that worked well for on-prem, unmanaged cloud and managed cloud customers (subtweeting myself of five-plus years ago).
4. They embraced object storage, and invested in what was missing to make it work for a DB engine, rather than trying to make HDFS happen in the cloud:
(4a) I remember sitting in on a meeting with a company that were working on an S3 cache a few years ago. If we had been laser-focused on the cloud and on object storage I think I might have 'got it'.
5. They had some really nice ideas around nested data, and taking advantage of cheap storage to make it more efficient in S3:
(5a) I remember it being pretty obvious that cheap storage should be taken advantage of - just like Vertica materialises > 1 projections - but something about HDFS storage being on-prem and shared meant that customers were often surprisingly reluctant to let the DB use the space.
last tweet: I think we did a lot of things right in Impala - and it's still processing millions of queries worldwide as far as I know - but I think it's fair to say that being more <shudder> cloud native would have been a strategic advantage.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Buried in the "deploy on Friday!!" argument is an interesting generalisation of 'crash-only' software; the idea that you should design features to fail *fast* so that you can detect and react close to deploy time.
I don't think this is actually possible in the general case.
Infrastructure software can go wrong sloooowwwwly, in ways that are triggered only by prod traffic. Wiithout any data, I suggest that for this class of bugs maybe 80% of them could be caught within three days, whereas only 30% of them might show up in the first 8-10 hours.
That's not a systematic problem, so much as it's endemic to software that manages resources. Not deploying on Friday just means that you want to keep your expected number of weekend bugs below a minimum, because weekends are _different_, which seems fine to me.
This turned into a great discussion. I was looking for resources I could give to someone designing their first few systems; advice and suggestions about how to structure their thoughts, critique their designs, and iterate.
Ok, perhaps let's pump the brakes for a second. Systems and database engineering isn't dead; this paper didn't just replace Postgres with a GPU and a copy of the deep learning book. (1/n)
I mean, what's been done here is fascinating (partly in its simplicity). Replacing B-Trees - which are functions from keys to ranges on disks - with a learned function is kind of cool.
The fact that the learned representation is more compact is very neat. But also it's not really a surprise that, given the entire dataset, we can construct a more compact function than a B-tree which is *designed* to support efficient updates.