Very excited to announce our #NeurIPS2022 paper No Free Lunch from Deep Learning in Neuroscience: A Case Study through Models of the Entorhinal-Hippocampal Circuit.
It's a story about NeuroAI, told through a story about grid & place cells.
@KhonaMikail@FieteGroup The promises of deep learning-based models of the brain are that they (1) shed light on the brain’s fundamental optimization problems/solutions, and/or (2) make novel predictions. We show, using deep network models of the MEC-HPC circuit, that one may get neither! 2/15
@KhonaMikail@FieteGroup Prior work claims training networks to path integrate generically creates grid units (left). We empirically show & analytically explain why grid-like units only emerge in a small subset of biologically invalid hyperparameter space chosen post-hoc by the programmer (right). 3/15
@KhonaMikail@FieteGroup Result 1: Of the >11,000 networks we trained, most learned to accurately path integrate but <10% of networks able to so exhibited **possible** grid-like units (using a generous measure of “grid-like”). Path integration does not create grid units! 4/15
@KhonaMikail@FieteGroup Result 2: Grid units emerge only under a specific (& problematic - more later!) supervised target encoding. Cartesian & Radial readouts never yielded grid units, nor did Gaussian-shaped place cell-like readouts. Difference-of-Softmaxes readouts are necessary! 5/15
@KhonaMikail@FieteGroup What is this choice of supervised target, and why is it problematic? To produce grid-like units, the “place cell” population **must** have: (i) a single field per place cell, (ii) a single population-wide scale, (iii) a specific tuning curve called a Difference of Softmaxes. 6/15
@KhonaMikail@FieteGroup But real place cells don’t have any of these! Place cells have (i) multiple fields per cell, with (ii) heterogeneous scales, and (iii) diverse tuning curves nothing like Difference-of-Softmaxes. Shoutout to @MariRSosa for helping me find the beautiful example tuning curve! 7/15
@KhonaMikail@FieteGroup@MariRSosa In order to produce grid-like units, one needs to use biologically incorrect supervised targets to bake the desired result into the networks. When grid-like units emerge, do they at least have key properties of grid cells (multiple modules, specific ratios btwn modules)? No! 8/15
@KhonaMikail@FieteGroup@MariRSosa Result 3: Multiple modules do not emerge - over a sweep around ideal hyperparameters, the grid period distribution is always unimodal, in contrast with the brain. Artificial grid periods are set by a hyperparameter choice and so do not provide a fundamental prediction. 9/15
@KhonaMikail@FieteGroup@MariRSosa Result 4: We can analytically explain why we observe these empirical results, using Fourier analysis of Turning instability similar to that in first-principles continuous attractor models. 10/15
@KhonaMikail@FieteGroup@MariRSosa Result 5: Grid-like unit emergence is highly sensitive to one hyperparameter -- the width of the “place cells” -- and occurs much less often if the hyperparameter is changed by a tiny amount, e.g. 12 cm works well, 11 cm and 13 cm do not 11/15
@KhonaMikail@FieteGroup@MariRSosa Result 6: What happens if we try making the supervised target “place cells” more biologically realistic by adding a small amount of heterogeneity and permitting place cells to have > 1 field? Grid-like units don’t appear, even though task performance is unaffected! 12/15
@KhonaMikail@FieteGroup@MariRSosa Takeaway for MEC/HPC: (1) Biologically incorrect supervised targets are specifically chosen to bake grid-like units into the networks, even though (2) the emergent grid-like units lack key properties of biological grid cells (multiple modules, module ratios). 13/15
@KhonaMikail@FieteGroup@MariRSosa Takeaway for NeuroAI: It is highly improbable that a path integration objective for ANNs would have produced grid cells as a novel prediction, had grid cells not been known to exist. Thus, our results challenge the notion that DL offers a free lunch for Neuroscience. 14/15
The increasing presence of AI-generated content on internet raises critical question:
What happens when #GenerativeAI is pretrained on web-scale datasets containing data created by earlier models?
Many have prophesied that such models will progressively degrade - Model Collapse!
(fig. from @NaturePortfolio)
2/9
Contribution #1: The model collapse phenomenon studied by the @NaturePortfolio 2024 paper is attributable to deleting data en masse between model-fitting iterations (left).
If data instead accumulate over time, then model collapse is avoided
Our story begins in 2014: An influential methodology in #neuroscience is pioneered by @dyamins & Jim DiCarlo, arguing that task-optimized deep networks should be considered good models of the brain if (linear) regressions predict biological population responses well
2/12
This neural regressions methodology becomes wildly popular in vision, audition, language
A NeurIPS 2021 Spotlight extend this regressions methodology to spatial navigation in medial entorhinal cortex (MEC)
They find certain deep networks are ✨amazing ✨ models of MEC
Model collapse arose from asking: what happens when synthetic data from previous generative models enters the pretraining data supply used to train new generative model?
I like Shumailov et al.'s phrasing:
"What happens to GPT generations GPT-{n} as n increases?"
2/N
Let's identify realistic pretraining conditions for frontier AI models to make sure we study the correct setting
1. Amount of data: 📈 Llama went from 1.4T tokens to 2T tokens to 15T tokens
2. Amount of chips: 📈 Llama went from 2k to 4k to 16k GPUs
Predictable behavior from scaling AI systems is desirable. While scaling laws are well established, how *specific* downstream capabilities scale is significantly muddier eg. @sy_gadre @lschmidt3 @ZhengxiaoD @jietang
@sy_gadre @lschmidt3 @ZhengxiaoD @jietang We identify a new factor for widely-used multiple choice QA benchmarks e.g. MMLU:
Downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively deteriorate the statistical relationship between performance and scale
What happens when generative models are trained on their own outputs?
Prior works foretold of a catastrophic feedback loop, a curse of recursion, descending into madness as models consume their own outputs. Are we poisoning the very data necessary to train future models?
1/N
Excited to announce our newest preprint!
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
w/ @MGerstgrasser @ApratimDey2 @rm_rafailov @sanmikoyejo @danintheory @Andr3yGR @Diyi_Yang David Donoho
@MGerstgrasser @ApratimDey2 @rm_rafailov @sanmikoyejo @danintheory @Andr3yGR @Diyi_Yang Many prior works consider training models solely on data generated by the preceding model i.e. data are replaced at each model-fitting iteration. Replacing data leads to collapse, but isn’t done in practice.
What happens if data instead accumulate across each iteration?
A few weeks ago, Stanford AI Alignment @SAIA_Alignment read @AnthropicAI 's "Superposition, Memorization, and Double Descent." Double descent is relatively easy to describe, but **why** does double descent occur?
@SAIA_Alignment @AnthropicAI Prior work answers why double descent occurs, but we wanted an intuitive explanation that doesn’t require RMT or stat mech. Our new preprint identifies, interprets the **3** necessary ingredients for double descent, using ordinary linear regression!
@SAIA_Alignment @AnthropicAI Using intro linear algebra, we show what the difference will be between the best possible prediction and the fit model’s predictions, in both the underparam & overparam regimes, to reveal an interaction btwn **3 quantities that are necessary to produce double descent**