, 24 tweets, 5 min read Read on Twitter
1/ Today we at Elementl are excited to launch an early release of Dagster, an open-source Python library for building data applications. Here's a post about what Dagster is, why I moved to data infra, why data is hard, and why we need a new system. medium.com/p/dbd28442b2b7
2/ We also did a talk a Data Council a couple of months ago. The code samples are out-of-date in the talk (We’ve been working hard!) but the core themes and tooling demos are still relevant.
3/ What is Dagster? A library for building a logical graph of functional computations that consume and produce data assets. This is how we define "data application". Computations themselves can be anything: Spark, SQL, Python, etc

GitHub: github.com/dagster-io/dag…
4/ By adopting this library, builders and operators gain access to new tooling, built on an API. These tools are meant for visualization, configuration, local development, testing, monitoring, etc. Because it's an open, well-defined API, it's also a tool-building platform.
5/ These computational graphs are (a) abstract and (b) queryable and operable over an API. They can be deployed to arbitrary compute targets, e.g. Airflow, Dask, FaaS, k8s-based engines. Dagster tools are shared regardless of physical compute substrates.
6/ Origin: I left FB in Feb ‘17 and started looking for my next challenge. I kept on hearing from people that their biggest tech problem was "their data is totally broken". I didn't understand what that meant initially.
7/ The most direct expression of this is when people say “I spent 80% of my time cleaning the data, and 20% of my time doing my job.” While they say that, they are actually describing deeper pathologies.
8/ Taking this statement literally one would work exclusively on making data cleaning faster. However that is not what people *mean*. They mean they waste lots of time. Building one-off infra, doing systemically repetitive things, unable to truly build on others work, etc.
9/ Reminded me of the frontend ecosystem circa a decade ago. Back then engineers would say they spend "80% of their time fighting the browser, and 20% of their time building their app”. Again they said one thing and meant another. The problem was primarily software abstraction.
10/ Fast forward 10 years, and no one says that anymore in frontend. Browsers got better but it is the software abstractions that proved decisive, especially but not exclusively React. People still complain, but no one really says they waste 80% of their time.
11/ React got a lot of things right. Defined its domain well, nailed the abstraction for that domain, adopted formal comp sci constructs to frontend and made them approachable, and was both a step function improvement and incrementally adoptable.
12/ React also respected the discipline. Devs were not scripting web pages; they were building full apps. React acknowledged the *essential* complexity of this domain and built constructs to match that complexity. JS used to be considered eng backwater. No longer true.
13/ We believe the data domain is on the cusp of a similar transition, and we want help drive that. Data engs/scientists should no longer be stitching together disconnected jobs. They should be building full data applications.
14/ We believe that ETL, ELT, ML Pipelines, data integration, etc are a single category of software. ETL produces a file/table; ML pipeline produces a model. Other than that structurally similar/identical: They are data applications.
15/ We define data applications as graphs of functional computations that produce and consume data assets. They are increasingly complex and mission-critical to businesses today. They also require unique approaches because they have unique properties.
16/ First data apps don’t control their inputs. A normal app can reject invalid input from users. Not true with data apps. Incoming data changes all the time. Can't update data so you have to update the code. Data apps must account for this unfortunate reality.
17/ Data apps are multi-tool and -persona. Often you have analysts, eng, data eng/science all collaborating on the same logical app. They use a variety tools (spark, data warehouse, notebooks, python etc). Massive amount of context lost as data flow across tool boundaries.
18/ Really hard to test. They have dependencies on external, hosted services (e.g. Redshift, Snowflake) or heavyweight runtimes (e.g. Spark). Business logic encoded in these systems. Cannot faithfully mock out or fake. Doing so is too much effort.
19/ High latency/computationally intensive make for extraordinary long developer feedback loop cycles. Can be hours when it ideally should be seconds. Changing the system very high cost. Can easily result in poorly structured systems with low code quality and low productivity.
20/ We’re not claiming to “solve” testability, but providing a software structure make it more possible. We’re not claiming to make the impossible easy; we are claiming that we can make the impossible possible.
21/ We believe that these issues are best addressed with a software abstraction. In this case, we believe there should be a layer that can describe and model a data app regardless of programming language, computational runtime, orchestration engine etc.
22/ For the GraphQL-aware: Structurally this serves a similar role in data as GraphQL in the API domain. A software abstraction backed by arbitrary compute that one can build shared tooling on top of and deploy to any infrastructure. Type system, metadata etc software-defined.
23/ We are early with this project and looking for just a few additional design partners/adopters to work with. The idea is to directly work/embed with your team and get into a fast feedback cycle etc to ensure that you are successful. DMs open or email hello at elementl dot com
24/ We are also looking for additional founding team members! All the way from full stack, dev tools/PL folks to data eng/science. Must have a passion for tools and belief in abstractions to reshape more than dev workflow, but orgs and industries. DMs open or email (see above)
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Nick Schrock
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!