Jascha Sohl-Dickstein Profile picture
Senior staff research scientist, Google Brain. Inventor of diffusion models. Machine learning $\otimes$ physics $\otimes$ neuroscience. @jascha@sigmoid.social
Feb 12 8 tweets 3 min read
Have you ever done a dense grid search over neural network hyperparameters? Like a *really dense* grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges. The boundary between trainable and untrainable neural network hyperparameter configurations is *fractal*! And beautiful!

Here is a grid search over a different pair of hyperparameters -- this time learning rate and the mean of the parameter initialization distribution.
Mar 9, 2023 6 tweets 5 min read
The hot mess theory of AI misalignment (+ an experiment!)
sohl-dickstein.github.io/2023/03/09/coh…

There are two ways an AI could be misaligned. It could monomaniacally pursue the wrong goal (supercoherence), or it could act in ways that don't pursue any consistent goal (hot mess/incoherent). Most work on AI misalignment risk is based on an assumption that more intelligent AI will also be more coherent. This is an assumption we can test! I collected subjective judgements of intelligence and coherence from colleagues in ML and neuro.
Nov 18, 2022 8 tweets 5 min read
If there is one thing the deep learning revolution has taught us, it's that neural nets will outperform hand-designed heuristics, given enough compute and data.

But we still use hand-designed heuristics to train our models. Let's replace our optimizers with trained neural nets! If you are training models with < 5e8 parameters, for < 2e5 training steps, then with high probability this LEARNED OPTIMIZER will beat or match the tuned optimizer you are currently using, out of the box, with no hyperparameter tuning (!).

velo-code.github.io
Nov 7, 2022 8 tweets 5 min read
My first blog post ever! Be harsh, but, you know, constructive.

Too much efficiency makes everything worse: overfitting and the strong version of Goodhart's law
sohl-dickstein.github.io/2022/11/06/str…

🧵 ImageImage The phenomenon of overfitting in machine learning maps onto a class of failures that frequently happen in the broader world: in politics, economics, science, and beyond. Doing too well at targeting a proxy objective can make the thing you actually care about get much, much worse. ImageImage
Jan 26, 2021 5 tweets 2 min read
CALL FOR TASKS CAPTURING LIMITATIONS OF LARGE LANGUAGE MODELS

We are soliciting contributions of tasks to a *collaborative* benchmark designed to measure and extrapolate the capabilities and limitations of large language models. Submit tasks at github.com/google/BIG-Ben…
#BIGbench All accepted task submitters will be co-authors on the paper releasing the benchmark. Teams at Google and OpenAI will further evaluate BIG-Bench on their best-performing model architectures, across models spanning from tens of thousands through hundreds of billions of parameters.
Aug 8, 2020 14 tweets 6 min read
"Finite Versus Infinite Neural Networks: an Empirical Study." arxiv.org/abs/2007.15801 This paper contains everything you ever wanted to know about infinite width networks, but didn't have the computational capacity to ask! Like really a lot of content. Let's dive in. Infinite width Neural Network Gaussian Process (NNGP) and Neural Tangent Kernel (NTK) predictions can outperform finite networks, depending on architecture and training practices. For fully connected networks the infinite width limit reliably outperforms the finite network. Image