there's so much content on how to build AI agents, but no one ever talks about the data engineering pipelines that support them
here's a thread going over the basics of data engineering:
the goal of data engineering:
- extract data from various sources
- transform it into structured format
- load into a data warehouse like Snowflake
and this structured data often is used context for AI systems to make personalized recommendations
why this matters for AI:
to build personalized AI systems, you need clean, structured data
as the more structured and labeled it is, the more granular/accurate context we can retrieve for an AI system
step 1: extract data from sources
where is your data coming from?
- Google Sheets (pull via API)
- websites (web scraping)
- third-party APIs
- existing databases
this part depends entirely on what data you're trying to collect
step 2: transform the raw data
raw data is messy and unstructured - contract text, website content, whatever
you need to:
- clean up missing values
- add structure to make it consistent
- standardize formats across different sources
AI is often used in transformation pipelines as well
an example I’ve done: extracting entities from contract text like specific clauses, start/end dates, party names
raw text isn't in clean tabular format obviously
so we use to AI to pull information from unstructured documents to put them in a tabular format
(like think of a google sheet with a column for each clause and key entity we want to collect for example)
step 3: load the data into a data warehouse
push the cleaned, structured data into Snowflake or similar warehouse
and now you have tables of organized data that you can easily query and use as context
the tools involved:
- Python for scripting
- AWS for cloud infrastructure
- Airflow for workflow orchestration
- Snowflake for data warehousing
- DBT for data transformations
there’s an entire tech stack that comes with this shit and that’s why data engineers get paid good af
note as well why reliability is critical af with data pipelines is
if if breaks anywhere here, every AI system built on top of it is cooked
bad data here = broken systems downstream
to recap data engineering basics:
- extract data from various sources (APIs, scraping, databases)
- transform raw data into structured format (clean, standardize, extract entities)
- load into data warehouse
- build reliably with error handling and monitoring
• • •
Missing some Tweet in this thread? You can try to
force a refresh
how to reverse engineer any successful AI product:
step 1: understand the manual process
before diving into a technical analysis, figure out what human task this AI product is automating
> what would someone do manually to achieve the same result?
> what decisions need to be made?
> what data is required at each step?
> what is the most painful part of this task that people are paying to automate?
step 2: create your own technical hypothesis
based on your knowledge of AI fundamentals (embeddings, RAG, APIs, etc.)
sketch out how YOU would build this
don't overthink it - focus on the core workflow and data flow
how to build your first AI agent (complete roadmap):
step 1: find a real problem worth solving
forget about AI for a second and think about tasks that:
> take up hours of someone's time every week
> are repetitive and monotonous
> cost the business real money when delayed
> currently require employees to do manually
classic example: customer support tickets
responding to the same questions over and over again eats up tons of time
but it's critical for keeping customers happy
this is the type of problem where an AI agent can actually provide real value