A key decision that you have to make for each ML problem is to decide whether to: (1) buy a vendor's pre-built solution (2) build your own
Make this decision based on whether you have access to more data than the vendor.
This is also a handy rule to choose between vendors.
When building your own ML solution, avoid the temptation to build from scratch.
The best return on investment early in your projects is going to come from collecting more data (both more "rows" and more "columns", by breaking down data silos)
Use standard and/or low-code models
The tech stack depends on the type of ML application: 1. Predictive analytics 2. Unstructured data 3. Automation 4. Recommendations
I'll quickly summarize my recommendations for each, but read article linked from headline tweet for more details.
For predictive analytics, the key thing is to use a tech stack where you can keep growing your data and train ML models without data movement
Build an EDW.
Train BigQuery ML models
When improvements due to data size plateau, build Tensorflow/Keras models that read off BigQuery
For unstructured data, the ROI of AutoML is hard to beat for small to medium data sizes.
For large data sizes, use pre-built models that have already been written to use TPUs efficiently. Start with transfer learning, then do fine tuning, and train from scratch.
For automation, you will be training several models and orchestrating them.
Some of them will be pre-built (eg Document AI). Others will be low-code (eg BigQuery ML). Others will be no-code (eg Auto ML Video Intelligence).
You need a unified AI platform. Use Vertex AI pipelines
For recommendations, you again need an enterprise data warehouse (EDW).
You need one well integrated with your transactional databases.
Use Datastream to do CDC into BigQuery.
Start with BigQuery ML. Move to Recommendations AI. Once improvements plateau, train from scratch.
And as always, reach out to your Google Cloud account team if you want to talk through your options and brainstorm of what approach to start with.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Many data engineers and CIOs tend to underestimate an ironic aspect of a dramatic increase in data volumes.
The larger the data volume gets, it makes more and more sense to process the data *more* frequently!
🧵
To see why, say that a business is creating a daily report based on its website traffic and this report took 2 hours to create.
If the website traffic grows by 4x, the report will take 8 hours to create. So, the tech people 4x the number of machines.
This is wrong-headed!
2/
Instead, consider an approach that makes the reports more timely:
* Compute statistics on 6 hours of data 4 times a day
* Aggregate these 6 hourly reports to create daily reports
* You can update your "daily" report four times a day.
* Data in report is only 6 hrs old!
3/
Five months later, our ML patterns book is #3 in AI, behind only the top ML intro book and the top research one. Very grateful for the validation ... W/ @SRobTweets amazon.com/Machine-Learni…
Like most authors, we keep hitting F5 to read the reviews 😁 My favorites 🧵👇
"When I was learning C++, I found the Gang of Four book "Design Patterns" accomplished a similar goal to help bridge the gap between academic knowledge and practical software engineering. Much like with the GoF book I suspect I may be re-reading parts of this book in the future"
"must-read for scientists and practitioners looking to apply machine learning theory to real life problems. I foresee this book becoming a classical of the discipline’s literature."