from @DivitaVohra, an overview of @Spotify's ML platform. super cool to hear how a product manager thinks about the problem of supporting ML systems
from @jeffboudier, an overview of the awesome work being done at @huggingface, with a focus on democratization of best practices, e.g. fast inference with Infinity
from @MarkMoyou of @nvidia, a really lucid overview of how to achieve low latency and high throughput in models on GPU. the visualization of sync/async GPU and CPU dataloaders is slick!
and last (in temporal order!) from my top 4, @LambdaAPI COO Mitesh Agrawal on how to build a datacenter for GPU-accelerated ML
🔑: different compute, storage, and networking needs for different parts of the pipeline (HPO, distributed training, inference)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
really cool new #AISTATS2022 paper presenting 1) a particular setting for model monitoring and 2) a provably optimal strategy for requesting ground truth labels in that setting.
plus a bonus example, and theorem, on why you shouldn't just do anomaly detection on logits!
Read through these awesome notes by @chipro and noticed something interesting about distribution shifts: they form a lattice, so you can represent them like you do sets, ie using a Venn diagram!
I find this view super helpful for understanding shifts, so let's walk through it.
(inb4 pedantry: the above diagram is an Euler diagram, not a Venn diagram, meaning not all possible joins are represented. that is good, actually, for reasons to be revealed!)
From the notes: joint distribution of data X and targets Y is shifting. We can decompose the joint into two pieces (marginal and conditional) in two separate ways (from Y or X).
There are four major classes of distribution shift, defined by which pieces vary and which don't.
There's been some back-and-forth about this paper on getting gradients without doing backpropagation, so I took a minute to write up an analysis on what breaks and how it might be fixed.
tl;dr: the estimated gradients are _really_ noisy! like wow
The main result I claim is an extension of Thm 1 in the paper. They prove that the _expected value_ of the gradient estimate is the true gradient, and I worked out the _variance_ of the estimate.
It's big! Each entry has variance equal to the entire true gradient's norm😬
(Sketch of the proof: nothing is correlated, everything has 0 mean and is symmetric around the origin, the only relevant terms are chi-squared r.v.s with known variances that get scaled by the gradient norms. gaussians are fun!)
the final video for the @weights_biases Math4ML series, on probability, is now up on YouTube!
@_ScottCondron and I talk entropies, divergence, and loss functions
🔗:
this is the final video in a four-part series of "exercise" videos, where Scott and I work through a collection of Jupyter notebooks with automatically-graded Python coding exercises on math concepts
New video series out this week (and into next!) on the @weights_biases YouTube channel.
They're Socratic livecoding sessions where @_ScottCondron and I work through the exercise notebooks for the Math4ML class.
Details in 🧵⤵️
Socratic: following an ancient academic tradition, I try to trick @_ScottCondron into being wrong, so that students can learn from mistakes and see their learning process reflected in the content.
(i was inspired to try this style out by the @PyTorchLightnin Master Class series, in which @_willfalcon and @alfcnz talk nitty-gritty of DL with PyTorch+Lightning while writing code. strong recommend!)