Jascha Sohl-Dickstein Profile picture
Member of the technical staff @ Anthropic. Most (in)famous for inventing diffusion models. AI + physics + neuroscience + dynamics.
Sep 28 25 tweets 24 min read
Title: Advice for a young investigator in the first and last days of the Anthropocene

Abstract: Within just a few years, it is likely that we will create AI systems that outperform the best humans on all intellectual tasks. This will have implications for your research and career! I will give practical advice, and concrete criteria to consider, when choosing research projects, and making professional decisions, in these last few years before AGI.

This is my current go-to academic talk. It's mostly targeted at early career scientists. It gets diverse and strong reactions. Let's try it here. Posting slides with speaker notes...

--

The title is a play on a very opinionated and pragmatic book by the nobel prize winner ramon y cajal, who is one of the founders of modern neuroscience.

To get you in the right mindset, on the right we have a plot of GDP vs time.
That is you, standing precariously on the top of that curve.
You are thinking to yourself -- I live in a pretty normal world.
Some things are going to change, but the future is going to look mostly like a linear extrapolation of the present.

And the plot should suggest that this may not be the right perspective on the future.

This plot by the way looks surprisingly similar even if you plot it on a log scale. We didn't stabilize on our current rate of growth until around 1950.Image No notes, just a talk outline. Image
Feb 12, 2024 8 tweets 3 min read
Have you ever done a dense grid search over neural network hyperparameters? Like a *really dense* grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges. The boundary between trainable and untrainable neural network hyperparameter configurations is *fractal*! And beautiful!

Here is a grid search over a different pair of hyperparameters -- this time learning rate and the mean of the parameter initialization distribution.
Mar 9, 2023 6 tweets 5 min read
The hot mess theory of AI misalignment (+ an experiment!)
sohl-dickstein.github.io/2023/03/09/coh…

There are two ways an AI could be misaligned. It could monomaniacally pursue the wrong goal (supercoherence), or it could act in ways that don't pursue any consistent goal (hot mess/incoherent). Most work on AI misalignment risk is based on an assumption that more intelligent AI will also be more coherent. This is an assumption we can test! I collected subjective judgements of intelligence and coherence from colleagues in ML and neuro.
Nov 18, 2022 8 tweets 5 min read
If there is one thing the deep learning revolution has taught us, it's that neural nets will outperform hand-designed heuristics, given enough compute and data.

But we still use hand-designed heuristics to train our models. Let's replace our optimizers with trained neural nets! If you are training models with < 5e8 parameters, for < 2e5 training steps, then with high probability this LEARNED OPTIMIZER will beat or match the tuned optimizer you are currently using, out of the box, with no hyperparameter tuning (!).

velo-code.github.io
Nov 7, 2022 8 tweets 5 min read
My first blog post ever! Be harsh, but, you know, constructive.

Too much efficiency makes everything worse: overfitting and the strong version of Goodhart's law
sohl-dickstein.github.io/2022/11/06/str…

🧵 ImageImage The phenomenon of overfitting in machine learning maps onto a class of failures that frequently happen in the broader world: in politics, economics, science, and beyond. Doing too well at targeting a proxy objective can make the thing you actually care about get much, much worse. ImageImage
Jan 26, 2021 5 tweets 2 min read
CALL FOR TASKS CAPTURING LIMITATIONS OF LARGE LANGUAGE MODELS

We are soliciting contributions of tasks to a *collaborative* benchmark designed to measure and extrapolate the capabilities and limitations of large language models. Submit tasks at github.com/google/BIG-Ben…
#BIGbench All accepted task submitters will be co-authors on the paper releasing the benchmark. Teams at Google and OpenAI will further evaluate BIG-Bench on their best-performing model architectures, across models spanning from tens of thousands through hundreds of billions of parameters.
Aug 8, 2020 14 tweets 6 min read
"Finite Versus Infinite Neural Networks: an Empirical Study." arxiv.org/abs/2007.15801 This paper contains everything you ever wanted to know about infinite width networks, but didn't have the computational capacity to ask! Like really a lot of content. Let's dive in. Infinite width Neural Network Gaussian Process (NNGP) and Neural Tangent Kernel (NTK) predictions can outperform finite networks, depending on architecture and training practices. For fully connected networks the infinite width limit reliably outperforms the finite network. Image