lol. So, we're doing some image processing with TPUs. We want to save the results directly to our cloud bucket, rather than having the results be transmitted to our VM, saved locally, then uploaded to our cloud bucket. Got a funny idea...
I guess this will be a ramble:
TPUs support a limited number of operations. But what you get in exchange is a blazingly-fast TPU.
A TPU consists of 8 cores, plus a CPU. (Yes, the TPU has a CPU -- weird concept, but think of it like a big computer with 8 GPUs. Obviously, a computer with GPUs has a CPU.)
In the same way that GPUs are much more restrictive than CPUs – it's a lot easier to write programs for CPUs than GPUs! – the TPU cores are much more restrictive than the TPU's CPU.
But that's a positive statement. It means you get some nice flexibility with the TPU's CPU.
(3.51 minutes for v3-512 is slightly faster than their posted results of 3.85min, too!)
This raises a question: *why* is the official benchmark so blazingly fast? That’s about 930 examples/sec per core. When I tried to write my own code, I could only get 250ex/sec per core. Are they cheating? *gasp*