@AI_for_Science@KhonaMikail@FieteGroup The central promise of DL-based models of the brain are that they (1) shed light on the brain’s fundamental optimization problems/solutions, and/or (2) make novel predictions. We show, using DL models of grid cells in the MEC-HPC circuit, that one often gets neither 2/13
@AI_for_Science@KhonaMikail@FieteGroup Prior work claims that training artificial networks (ANNs) on a path integration task generically creates grid cells (a). We empirically show and analytically explain why grid cells only emerge in a small subset of hyperparameter space chosen post-hoc by the programmer (b). 3/13
@AI_for_Science@KhonaMikail@FieteGroup Result 1: Of the >3500 networks we trained, 60% learned to accurately encode position but only 7% exhibited **possible** grid-like cells (the sweep was already biased in hyperparameters towards creating grid cells) 4/13
@AI_for_Science@KhonaMikail@FieteGroup Result 2: Grid cell emergence requires a highly specific supervised target encoding: Simple cartesian, radial spatial readouts never yielded grd cells, nor did Gaussian-shaped place cell-like readouts. Grid cell emergence required difference-of-softmaxed-Gaussian readouts 5/13
@AI_for_Science@KhonaMikail@FieteGroup Result 3: Artificial grid periods are set by a hyperparameter choice and so do not provide a fundamental prediction; multiple modules do not emerge. Over a wide sweep producing ideal grid units the grid period distr is unimodal in contrast with multiple periods in the brain 6/13
@AI_for_Science@KhonaMikail@FieteGroup Result 4: We can analytically explain why we observe these empirical results, using Fourier analysis of Turning instability similar to that in first-principles continuous attractor models 7/13
@AI_for_Science@KhonaMikail@FieteGroup Result 5: Grid unit emergence is highly sensitive to one hyperparameter -- the readout receptive field width -- and does not occur if the hyperparameter is changed by a tiny amount, e.g. 12 cm yields grid units, 11 cm and 13 cm do not 8/13
@AI_for_Science@KhonaMikail@FieteGroup Result 7: Grid cell emergence in prev publications also relies on a *critical but unstated* implementation detail. We use Fourier analysis and numerical simulations to explain why this particular and unusual implementation choice is necessary. 9/13
@AI_for_Science@KhonaMikail@FieteGroup Result 8: Artificial grid units disappear with more biologically realistic place cells. Adding a small amount of heterogeneity to place cell receptive fields causes grid cells to disappear 10/13
@AI_for_Science@KhonaMikail@FieteGroup Takeaway: It is highly improbable that a path integration objective for ANNs would have produced grid cells as a novel prediction, had grid cells not been known to exist. Thus, our results challenge the notion that DL offers a free lunch for Neuroscience 11/13
@AI_for_Science@KhonaMikail@FieteGroup Prospective Puzzle: ANN grid models have been claimed to explain variance in mouse MEC activity almost as well as variance explained by other mice. How are these networks able to predict mouse MEC neural activity so well? 12/13
@AI_for_Science@KhonaMikail@FieteGroup Prospective Answer: Deep networks may appear to be better models of biological networks because they provide higher-dimensional bases than alternative models, and thus trivially achieve higher correlation scores for linear regression-based comparisons. 13/13
The increasing presence of AI-generated content on internet raises critical question:
What happens when #GenerativeAI is pretrained on web-scale datasets containing data created by earlier models?
Many have prophesied that such models will progressively degrade - Model Collapse!
(fig. from @NaturePortfolio)
2/9
Contribution #1: The model collapse phenomenon studied by the @NaturePortfolio 2024 paper is attributable to deleting data en masse between model-fitting iterations (left).
If data instead accumulate over time, then model collapse is avoided
Our story begins in 2014: An influential methodology in #neuroscience is pioneered by @dyamins & Jim DiCarlo, arguing that task-optimized deep networks should be considered good models of the brain if (linear) regressions predict biological population responses well
2/12
This neural regressions methodology becomes wildly popular in vision, audition, language
A NeurIPS 2021 Spotlight extend this regressions methodology to spatial navigation in medial entorhinal cortex (MEC)
They find certain deep networks are ✨amazing ✨ models of MEC
Model collapse arose from asking: what happens when synthetic data from previous generative models enters the pretraining data supply used to train new generative model?
I like Shumailov et al.'s phrasing:
"What happens to GPT generations GPT-{n} as n increases?"
2/N
Let's identify realistic pretraining conditions for frontier AI models to make sure we study the correct setting
1. Amount of data: 📈 Llama went from 1.4T tokens to 2T tokens to 15T tokens
2. Amount of chips: 📈 Llama went from 2k to 4k to 16k GPUs
Predictable behavior from scaling AI systems is desirable. While scaling laws are well established, how *specific* downstream capabilities scale is significantly muddier eg. @sy_gadre @lschmidt3 @ZhengxiaoD @jietang
@sy_gadre @lschmidt3 @ZhengxiaoD @jietang We identify a new factor for widely-used multiple choice QA benchmarks e.g. MMLU:
Downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively deteriorate the statistical relationship between performance and scale
What happens when generative models are trained on their own outputs?
Prior works foretold of a catastrophic feedback loop, a curse of recursion, descending into madness as models consume their own outputs. Are we poisoning the very data necessary to train future models?
1/N
Excited to announce our newest preprint!
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
w/ @MGerstgrasser @ApratimDey2 @rm_rafailov @sanmikoyejo @danintheory @Andr3yGR @Diyi_Yang David Donoho
@MGerstgrasser @ApratimDey2 @rm_rafailov @sanmikoyejo @danintheory @Andr3yGR @Diyi_Yang Many prior works consider training models solely on data generated by the preceding model i.e. data are replaced at each model-fitting iteration. Replacing data leads to collapse, but isn’t done in practice.
What happens if data instead accumulate across each iteration?
A few weeks ago, Stanford AI Alignment @SAIA_Alignment read @AnthropicAI 's "Superposition, Memorization, and Double Descent." Double descent is relatively easy to describe, but **why** does double descent occur?
@SAIA_Alignment @AnthropicAI Prior work answers why double descent occurs, but we wanted an intuitive explanation that doesn’t require RMT or stat mech. Our new preprint identifies, interprets the **3** necessary ingredients for double descent, using ordinary linear regression!
@SAIA_Alignment @AnthropicAI Using intro linear algebra, we show what the difference will be between the best possible prediction and the fit model’s predictions, in both the underparam & overparam regimes, to reveal an interaction btwn **3 quantities that are necessary to produce double descent**