I'm updating the @OReillyMedia "Data Science on GCP" book to the 2nd ed.
amazon.com/Data-Science-G…
It's been 5 years since I wrote the first version. Best practices have changed. @googlecloud has gotten broader, deeper, easier.

As I update each chapter, I will note key changes.🧵
1. Data roles have become specialized.

In Ch 1, I predicted that data roles would converge --data analysts, data scientists, and data engineers would not be 3 separate roles. While that's happened in startups and some tech companies, many enterprises have built specialized teams Image
2. ELT is now best practice.

In Ch 2, I used to do ETL of the data (using the BTS server) before loading the data to Cloud Storage.

Now, I load the *raw* data into BigQuery, using it as a Data Lake. And then do transformation/cleanup in SQL using views. So, ELT instead of ETL.
3. Cloud Run rather than Cloud Functions

In Ch 2, I scheduled monthly updates using Cloud Scheduler & Functions. Now, instead of CF, I use Cloud Run which is far more flexible.

This code has changed from AppEngine (ed1) to CF (ed1tf2) to Cloud Run (ed2)
linkedin.com/posts/aaronnbr…
4. PostgreSQL instead of MySQL

In Ch 3, I used to employ Cloud SQL with MySQL. I changed it to Cloud SQL with PostgreSQL.

PostgreSQL is #4 on dbengines while MySQL is #2, but one's on the way up and the other's on the way down.

db-engines.com/en/ranking
5. Exploration in BigQuery

Ch 3 now includes a new section on using the BigQuery UI for data exploration -- Preview, Table Explorer, etc. make the ELT workflow easier. Image
6. BI Engine makes connections between Data Studio and BigQuery fast

With BI Engine making Data Studio calls to BigQuery becoming snappy, I changed the dashboard to use BigQuery rather than Cloud SQL as its backend.

Data Studio defaults have also gotten better, so less text Image
7. Arrays in SQL

BigQuery SQL supports arrays and UNNEST, so a lot of the bash scripting I had to do in order to create a contingency table has now gone away.

It's easy to create the contingency table with a single SQL call now, instead of doing hacky string replacement Image
8. JSON rather than CSV

JSON makes code easier to write, and easier to understand. Use JSON rather than CSV both for I/O and for passing a bunch of variables around as strings.

before and after ... Image
9. Use Beam Python even for streaming.

Beam Python is more concise and has similar performance now in Dataflow.

This is the real-time streaming pipeline in Chapter 4 that computes average delays at airports.

github.com/GoogleCloudPla…
vs
github.com/GoogleCloudPla…

Java vs. Python Image
10. DML in BigQuery!

Between two iterations of the streaming pipeline, I need to clear out the contents of the destination table. Before, we'd have to use the command-line tool or use the UI and put with the annoyance of typing in "delete" to confirm.

Now, just use TRUNCATE Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lαк Lαкѕнмαηαη

Lαк Lαкѕнмαηαη Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lak_gcp

1 Oct
DeepMind has just published a paper in Nature on doing nowcasting using neural networks. This paper uses generative models.

How does this compare to previous precipitation nowcasting models from Google?

nature.com/articles/s4158…
The first attempt was a U-Net implementation consisting of (1) spatial downsampler +CNN (2) spatial upsampler + CNN with skip connections to maintain knowledge at different scales.

This bettered the performance of optical flow methods, but not by much.
arxiv.org/abs/1912.12132
#2 MetNet (arxiv.org/abs/2003.12140) predicts probability of rainfall amounts > some value. It has: (1) spatial downsampler and CNN (2) temporal LSTM (3) spatial aggregator with self-attention to learn which part of input to focus.
Read 6 tweets
12 Jul
The biggest change I observed between my last visit to India (Dec 2019) and now is how much an Aadhar card (national id card) is woven into everything.

🧵
The Aadhar card provides significant benefits in terms of fraud prevention esp.. with government services.

It also helps Indian businesses know their customers, while KYC is a work-in-progress for most American ones.

The Aadhar card is one reason micro payments have taken off.
But the Aadhar card success has been driven by the Indian bureaucrat's unique mixture of authoritarianism and laziness.

For example, in the course of one hour at a bank branch, I saw 3 customers unable to access their accounts.

Let's call this bank I was at as Bank B.
Read 8 tweets
7 Jul
Let's put this one to rest using just publicly available information.

Many large, technically sophisticated enterprises are intentionally multi-cloud

🧵
Snap uses Google Cloud and AWS for different workloads
interconnected.blog/aws-gcp-real-w…
Apple uses Google Cloud and AWS for iCloud

neowin.net/news/apple-dit…

google.com/amp/s/www.thev…
Read 7 tweets
3 Jul
Many of our customers want to know how to choose a technology stack for solving problems with machine learning.

In this article, I summarize my thought process when suggesting a tech stack for ML. 🧵
A key decision that you have to make for each ML problem is to decide whether to:
(1) buy a vendor's pre-built solution
(2) build your own

Make this decision based on whether you have access to more data than the vendor.

This is also a handy rule to choose between vendors.
When building your own ML solution, avoid the temptation to build from scratch.

The best return on investment early in your projects is going to come from collecting more data (both more "rows" and more "columns", by breaking down data silos)

Use standard and/or low-code models
Read 9 tweets
21 Jun
Many data engineers and CIOs tend to underestimate an ironic aspect of a dramatic increase in data volumes.

The larger the data volume gets, it makes more and more sense to process the data *more* frequently!
🧵
To see why, say that a business is creating a daily report based on its website traffic and this report took 2 hours to create.

If the website traffic grows by 4x, the report will take 8 hours to create. So, the tech people 4x the number of machines.

This is wrong-headed!

2/
Instead, consider an approach that makes the reports more timely:

* Compute statistics on 6 hours of data 4 times a day
* Aggregate these 6 hourly reports to create daily reports
* You can update your "daily" report four times a day.
* Data in report is only 6 hrs old!

3/
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(