Rylan Schaeffer Profile picture
Nov 1, 2022 16 tweets 14 min read Read on X
Very excited to announce our #NeurIPS2022 paper No Free Lunch from Deep Learning in Neuroscience: A Case Study through Models of the Entorhinal-Hippocampal Circuit.

It's a story about NeuroAI, told through a story about grid & place cells.

Joint w/ @KhonaMikail @FieteGroup 1/15
@KhonaMikail @FieteGroup The promises of deep learning-based models of the brain are that they (1) shed light on the brain’s fundamental optimization problems/solutions, and/or (2) make novel predictions. We show, using deep network models of the MEC-HPC circuit, that one may get neither! 2/15
@KhonaMikail @FieteGroup Prior work claims training networks to path integrate generically creates grid units (left). We empirically show & analytically explain why grid-like units only emerge in a small subset of biologically invalid hyperparameter space chosen post-hoc by the programmer (right). 3/15
@KhonaMikail @FieteGroup Result 1: Of the >11,000 networks we trained, most learned to accurately path integrate but <10% of networks able to so exhibited **possible** grid-like units (using a generous measure of “grid-like”). Path integration does not create grid units! 4/15
@KhonaMikail @FieteGroup Result 2: Grid units emerge only under a specific (& problematic - more later!) supervised target encoding. Cartesian & Radial readouts never yielded grid units, nor did Gaussian-shaped place cell-like readouts. Difference-of-Softmaxes readouts are necessary! 5/15
@KhonaMikail @FieteGroup What is this choice of supervised target, and why is it problematic? To produce grid-like units, the “place cell” population **must** have: (i) a single field per place cell, (ii) a single population-wide scale, (iii) a specific tuning curve called a Difference of Softmaxes. 6/15
@KhonaMikail @FieteGroup But real place cells don’t have any of these! Place cells have (i) multiple fields per cell, with (ii) heterogeneous scales, and (iii) diverse tuning curves nothing like Difference-of-Softmaxes. Shoutout to @MariRSosa for helping me find the beautiful example tuning curve! 7/15
@KhonaMikail @FieteGroup @MariRSosa In order to produce grid-like units, one needs to use biologically incorrect supervised targets to bake the desired result into the networks. When grid-like units emerge, do they at least have key properties of grid cells (multiple modules, specific ratios btwn modules)? No! 8/15
@KhonaMikail @FieteGroup @MariRSosa Result 3: Multiple modules do not emerge - over a sweep around ideal hyperparameters, the grid period distribution is always unimodal, in contrast with the brain. Artificial grid periods are set by a hyperparameter choice and so do not provide a fundamental prediction. 9/15
@KhonaMikail @FieteGroup @MariRSosa Result 4: We can analytically explain why we observe these empirical results, using Fourier analysis of Turning instability similar to that in first-principles continuous attractor models. 10/15
@KhonaMikail @FieteGroup @MariRSosa Result 5: Grid-like unit emergence is highly sensitive to one hyperparameter -- the width of the “place cells” -- and occurs much less often if the hyperparameter is changed by a tiny amount, e.g. 12 cm works well, 11 cm and 13 cm do not 11/15
@KhonaMikail @FieteGroup @MariRSosa Result 6: What happens if we try making the supervised target “place cells” more biologically realistic by adding a small amount of heterogeneity and permitting place cells to have > 1 field? Grid-like units don’t appear, even though task performance is unaffected! 12/15
@KhonaMikail @FieteGroup @MariRSosa Takeaway for MEC/HPC: (1) Biologically incorrect supervised targets are specifically chosen to bake grid-like units into the networks, even though (2) the emergent grid-like units lack key properties of biological grid cells (multiple modules, module ratios). 13/15
@KhonaMikail @FieteGroup @MariRSosa Takeaway for NeuroAI: It is highly improbable that a path integration objective for ANNs would have produced grid cells as a novel prediction, had grid cells not been known to exist. Thus, our results challenge the notion that DL offers a free lunch for Neuroscience. 14/15
@KhonaMikail @FieteGroup @MariRSosa Full paper & reviews: openreview.net/forum?id=syU-X…
Public code: github.com/FieteLab/Fiete…

Questions, comments & criticisms welcome! 15/15
Also important to note: @mikkelhei 's lab independently found the same result:

"When analysing the spacing of cells with high grid score we could not find multiple modules."

biorxiv.org/content/10.110…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rylan Schaeffer

Rylan Schaeffer Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @RylanSchaeffer

Oct 23
📢New preprint📢

🔄Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World 🔄

A deeper dive into the effects of self-generated synthetic data on model-data feedback loops

w/ @JoshuaK92829 @ApratimDey2 @MGerstgrasser @rm_rafailov @sanmikoyejo

1/9Image
The increasing presence of AI-generated content on internet raises critical question:

What happens when #GenerativeAI is pretrained on web-scale datasets containing data created by earlier models?

Many have prophesied that such models will progressively degrade - Model Collapse!

(fig. from @NaturePortfolio)

2/9Image
Contribution #1: The model collapse phenomenon studied by the @NaturePortfolio 2024 paper is attributable to deleting data en masse between model-fitting iterations (left).

If data instead accumulate over time, then model collapse is avoided

Multivariate Gaussian modeling:

3/9 Image
Read 9 tweets
Oct 14
My 2nd to last #neuroscience paper will appear @unireps !!

🧠🧠 Maximizing Neural Regression Scores May Not Identify Good Models of the Brain 🧠🧠

w/ @KhonaMikail @neurostrow @BrandoHablando @sanmikoyejo

Answering a puzzle 2 years in the making



1/12openreview.net/forum?id=vbtj0…
Our story begins in 2014: An influential methodology in #neuroscience is pioneered by @dyamins & Jim DiCarlo, arguing that task-optimized deep networks should be considered good models of the brain if (linear) regressions predict biological population responses well

2/12 Image
This neural regressions methodology becomes wildly popular in vision, audition, language

A NeurIPS 2021 Spotlight extend this regressions methodology to spatial navigation in medial entorhinal cortex (MEC)

They find certain deep networks are ✨amazing ✨ models of MEC

3/12 Image
Read 12 tweets
Jul 26
Yesterday, I tweeted that model collapse appears when researchers intentionally induce it in ways that don't match what is done in practice

Let me explain using the Shumailov et al. @Nature 2024 paper's methodology as an example

Paper:

🧵⬇️

1/N nature.com/articles/s4158…
Model collapse arose from asking: what happens when synthetic data from previous generative models enters the pretraining data supply used to train new generative model?

I like Shumailov et al.'s phrasing:

"What happens to GPT generations GPT-{n} as n increases?"

2/N Image
Let's identify realistic pretraining conditions for frontier AI models to make sure we study the correct setting

1. Amount of data: 📈 Llama went from 1.4T tokens to 2T tokens to 15T tokens

2. Amount of chips: 📈 Llama went from 2k to 4k to 16k GPUs

3/N
Read 15 tweets
Jun 10
❤️‍🔥❤️‍🔥Excited to share our new paper ❤️‍🔥❤️‍🔥

**Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?**

w/ @haileysch__ @BrandoHablando @gabemukobi @varunrmadan @herbiebradley @ai_phd @BlancheMinerva @sanmikoyejo



1/N arxiv.org/abs/2406.04391
Image
Predictable behavior from scaling AI systems is desirable. While scaling laws are well established, how *specific* downstream capabilities scale is significantly muddier eg. @sy_gadre @lschmidt3 @ZhengxiaoD @jietang




Why?

2/N arxiv.org/abs/2403.08540
arxiv.org/abs/2403.15796

Image
Image
@sy_gadre @lschmidt3 @ZhengxiaoD @jietang We identify a new factor for widely-used multiple choice QA benchmarks e.g. MMLU:

Downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively deteriorate the statistical relationship between performance and scale

3/N Image
Read 11 tweets
May 1
What happens when generative models are trained on their own outputs?

Prior works foretold of a catastrophic feedback loop, a curse of recursion, descending into madness as models consume their own outputs. Are we poisoning the very data necessary to train future models?

1/N Image
Excited to announce our newest preprint!

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

w/ @MGerstgrasser @ApratimDey2 @rm_rafailov @sanmikoyejo @danintheory @Andr3yGR @Diyi_Yang David Donoho



2/Narxiv.org/abs/2404.01413
@MGerstgrasser @ApratimDey2 @rm_rafailov @sanmikoyejo @danintheory @Andr3yGR @Diyi_Yang Many prior works consider training models solely on data generated by the preceding model i.e. data are replaced at each model-fitting iteration. Replacing data leads to collapse, but isn’t done in practice.

What happens if data instead accumulate across each iteration?

3/N Image
Read 13 tweets
Mar 28, 2023
A few weeks ago, Stanford AI Alignment @SAIA_Alignment read @AnthropicAI 's "Superposition, Memorization, and Double Descent." Double descent is relatively easy to describe, but **why** does double descent occur?



1/8 transformer-circuits.pub/2023/toy-doubl…
Image
@SAIA_Alignment @AnthropicAI Prior work answers why double descent occurs, but we wanted an intuitive explanation that doesn’t require RMT or stat mech. Our new preprint identifies, interprets the **3** necessary ingredients for double descent, using ordinary linear regression!



2/8 arxiv.org/abs/2303.14151
Image
@SAIA_Alignment @AnthropicAI Using intro linear algebra, we show what the difference will be between the best possible prediction and the fit model’s predictions, in both the underparam & overparam regimes, to reveal an interaction btwn **3 quantities that are necessary to produce double descent**

3/8 Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(