The first attempt was a U-Net implementation consisting of (1) spatial downsampler +CNN (2) spatial upsampler + CNN with skip connections to maintain knowledge at different scales.
This bettered the performance of optical flow methods, but not by much. arxiv.org/abs/1912.12132
#2 MetNet (arxiv.org/abs/2003.12140) predicts probability of rainfall amounts > some value. It has: (1) spatial downsampler and CNN (2) temporal LSTM (3) spatial aggregator with self-attention to learn which part of input to focus.
MetNet gets much better accuracy scores. In fact, where the U-Net was barely able to beat HRRR after an hour, MetNet was able to successfully predict rainfall 8 hours ahead! This is far beyond what most people think nowcasting methods are capable of.
Exciting ... but ...
MetNet learns that creating a blurred output is an easier way to get high accuracy scores than to actually learn storm structures.
Variational generative models are probabilistic by nature (the encoder predicts probabilities; the decoder generates images from these)
So generative models provide a different way to represent uncertainty. Different than the traditional ensemble methods used in meteorology.
That said, the Deep Mind paper was done in the UK, and the UK doesn't really have much convective weather. So jury's out.
I'm updating the @OReillyMedia "Data Science on GCP" book to the 2nd ed. amazon.com/Data-Science-G…
It's been 5 years since I wrote the first version. Best practices have changed. @googlecloud has gotten broader, deeper, easier.
As I update each chapter, I will note key changes.🧵
1. Data roles have become specialized.
In Ch 1, I predicted that data roles would converge --data analysts, data scientists, and data engineers would not be 3 separate roles. While that's happened in startups and some tech companies, many enterprises have built specialized teams
2. ELT is now best practice.
In Ch 2, I used to do ETL of the data (using the BTS server) before loading the data to Cloud Storage.
Now, I load the *raw* data into BigQuery, using it as a Data Lake. And then do transformation/cleanup in SQL using views. So, ELT instead of ETL.
The biggest change I observed between my last visit to India (Dec 2019) and now is how much an Aadhar card (national id card) is woven into everything.
🧵
The Aadhar card provides significant benefits in terms of fraud prevention esp.. with government services.
It also helps Indian businesses know their customers, while KYC is a work-in-progress for most American ones.
The Aadhar card is one reason micro payments have taken off.
But the Aadhar card success has been driven by the Indian bureaucrat's unique mixture of authoritarianism and laziness.
For example, in the course of one hour at a bank branch, I saw 3 customers unable to access their accounts.
A key decision that you have to make for each ML problem is to decide whether to: (1) buy a vendor's pre-built solution (2) build your own
Make this decision based on whether you have access to more data than the vendor.
This is also a handy rule to choose between vendors.
When building your own ML solution, avoid the temptation to build from scratch.
The best return on investment early in your projects is going to come from collecting more data (both more "rows" and more "columns", by breaking down data silos)
Many data engineers and CIOs tend to underestimate an ironic aspect of a dramatic increase in data volumes.
The larger the data volume gets, it makes more and more sense to process the data *more* frequently!
🧵
To see why, say that a business is creating a daily report based on its website traffic and this report took 2 hours to create.
If the website traffic grows by 4x, the report will take 8 hours to create. So, the tech people 4x the number of machines.
This is wrong-headed!
2/
Instead, consider an approach that makes the reports more timely:
* Compute statistics on 6 hours of data 4 times a day
* Aggregate these 6 hourly reports to create daily reports
* You can update your "daily" report four times a day.
* Data in report is only 6 hrs old!
3/