I'm updating the @OReillyMedia "Data Science on GCP" book to the 2nd ed.
amazon.com/Data-Science-G…
It's been 5 years since I wrote the first version. Best practices have changed. @googlecloud has gotten broader, deeper, easier.
As I update each chapter, I will note key changes.🧵
1. Data roles have become specialized.
In Ch 1, I predicted that data roles would converge --data analysts, data scientists, and data engineers would not be 3 separate roles. While that's happened in startups and some tech companies, many enterprises have built specialized teams
2. ELT is now best practice.
In Ch 2, I used to do ETL of the data (using the BTS server) before loading the data to Cloud Storage.
Now, I load the *raw* data into BigQuery, using it as a Data Lake. And then do transformation/cleanup in SQL using views. So, ELT instead of ETL.
3. Cloud Run rather than Cloud Functions
In Ch 2, I scheduled monthly updates using Cloud Scheduler & Functions. Now, instead of CF, I use Cloud Run which is far more flexible.
This code has changed from AppEngine (ed1) to CF (ed1tf2) to Cloud Run (ed2)
linkedin.com/posts/aaronnbr…
4. PostgreSQL instead of MySQL
In Ch 3, I used to employ Cloud SQL with MySQL. I changed it to Cloud SQL with PostgreSQL.
PostgreSQL is #4 on dbengines while MySQL is #2, but one's on the way up and the other's on the way down.
db-engines.com/en/ranking
5. Exploration in BigQuery
Ch 3 now includes a new section on using the BigQuery UI for data exploration -- Preview, Table Explorer, etc. make the ELT workflow easier.
6. BI Engine makes connections between Data Studio and BigQuery fast
With BI Engine making Data Studio calls to BigQuery becoming snappy, I changed the dashboard to use BigQuery rather than Cloud SQL as its backend.
Data Studio defaults have also gotten better, so less text
7. Arrays in SQL
BigQuery SQL supports arrays and UNNEST, so a lot of the bash scripting I had to do in order to create a contingency table has now gone away.
It's easy to create the contingency table with a single SQL call now, instead of doing hacky string replacement
8. JSON rather than CSV
JSON makes code easier to write, and easier to understand. Use JSON rather than CSV both for I/O and for passing a bunch of variables around as strings.
before and after ...
9. Use Beam Python even for streaming.
Beam Python is more concise and has similar performance now in Dataflow.
This is the real-time streaming pipeline in Chapter 4 that computes average delays at airports.
github.com/GoogleCloudPla…
vs
github.com/GoogleCloudPla…
Java vs. Python
10. DML in BigQuery!
Between two iterations of the streaming pipeline, I need to clear out the contents of the destination table. Before, we'd have to use the command-line tool or use the UI and put with the annoyance of typing in "delete" to confirm.
Now, just use TRUNCATE
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.