A thread on the history of RL/ML based on Andy Barto's talk #RLC2024: the Reinforcement Learning Conference.
Beyond seeing friends & giving talks/panel, talking to @RichardSSutton & hearing Andy Barto revived a need for attention to historical psych/neuro influences on AI.
1/n🧵
Andy Barto started off the talk defining RL in terms of Search (trial & error, generate & test, variation & selection) + Memory (caching past solutions),
leading to RL as "General contextual search".
He then took us through a tour of historical intellectual influences on RL 2/n
First, RL work with @RichardSSutton & contemporary intellectual influences:
The logic of computers group @ Michigan (Burks, Holland, Zeigler: cellular automata, modeling, simulation)
The systems neuroscience center @ Amherst (Arbib, Kilmer, Spinelli: Adaptive Intelligence). 3/n
Learning by Trial & Error in RL is a la Klopf's law of effect for synaptic plasticity: Hedonistic Neurons maximize local analog of pleasure & minimize local pain. Synapses active in action potential become eligible for change-increase weight if rewarded, decrease if punished 4/n
Andy Barto distinguished 2 kinds of Eligibility for weight change.
Contingent Eligibility depends on pre- and post-synaptic activity, leading to 3-factor learning rule.
Non-contingent Eligibility is triggered by only pre-synaptic activity, leading to a 2-factor learning rule. 5/n
Bartio & @RichardSSutton's seminal first paper (1981): A modern theory of adaptive networks with expectation & prediction.
Influences include
-Klopf: learning by trial & error-idea dates back to 1800s
-adaptive intelligence
-synaptic plasticity inspired
-Sutton BA in psych! 6/n
@RichardSSutton Barto Acknowledging the many disciplines, interactions, & lineages of ideas that shaped RL was refreshing.
-TD error & alg for temporal credit assignment were inspired by interactions with psychology
-Pole balancing paper in collaboration with Chuck Anderson
Next:actor-critic 7/n
@RichardSSutton The Actor Critic architecture:
The actor is responsible for learning the policy, a mapping from states to actions, to decide which action to take in a given state.
An adaptive critic uses a value function to evaluate actor's policy & translate reward to TD error for learning. 8/n
@RichardSSutton Second part: Barto dove into the early history of machine learning
- Thomas Ross 1933 Thinking machine
- Steven Smith 1935 (psychology) Robot rats
- Grey Walter 1948 (neuroscience) Machina Speculatrix
- Alan Turing 1948: Pleasure-Pain system, earliest call to implementing RL? 9/n
@RichardSSutton - Farley & Clark 1954 first simulation of ANN learning on a digital computer
- Minsky 1954 "Neural Nets and the brain-model problem", SNARCs (stoachstic neural-analog reinforcement calculators), "Steps towards AI" (1961).
Challenge: Structural & Temporal credit assignment 10/n
@RichardSSutton - Farley & Clark 1955 generalization of pattern recognition in self-organizing system
- Frank Rosenblatt 1958 Perceptron "Foundation of AI"
- Arthur Samuel 1959-67 Checkers player (was RL)
- Widrow & Hoff 1960 Adaptive Linear Neuron, Widrow -Hoff algorithm, LMS 11/n
@RichardSSutton - Widrow et al 1973 Selective Bootstrap Adaptation. Rewarded? Treat committed action as target, do LMS. Punished: treat alternative action as target then LMS.
- Michael Tsetlin 1960s Learning Automata (& modeling biological systems 1973), teams & games 12/n
- Schultz Dayan Montague 1997 Reward-Prediction-Error in brains: Phasic activity of dopamine neurons signals the error between an old & a new estimate of future reward
Barto noted the critic in RL can take virtually any unmodeled influence & turn that into a learning signal 13/n
Dopamine inspired Collective Learning, reinforcement is broadcasted to a team of RL units. Potential alt to back prop, but RL broadcast didn't scale: structural credit assignment problem (getting signal to the right place).
cf Barto 1985 Learning by statistical cooperation 14/n
In the end, a call for critical thinking. Andy Barto shared stories to caution against the challenges of designing reward signals.
Quoting Norbert Wiener's example of The Monkey's paw:
"... it grants what you asked for, not what you should have asked for or what you intend"
15/n
Andy Barto then thanked his former students, said RL is not a cult, & found himself facing a standing ovation by the audience.
I appreciated the history of ML from the POV of developing a learning framework, & how interdisciplinary interactions of ideas shaped it.
TY #RLC2024
n/n
Many thanks to all organizers, esp @MarlosCMachado & @robertarail for heroic paper awards & for making RLC a warm & friendly experience.
Special thanks to @RichardSSutton for continual support & generous discussions on both specific & big picture ideas on learning & intelligence.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Delighted to share our #neurips2023 paper w @grockious @hmd_palangi et al
Evaluating Cognitive Maps & Planning in LLMs with CogEval
We test planning in 8 LLMs.
Failures like hallucinating invalid paths/falling in loops don't support emergent planning. 1/n arxiv.org/abs/2309.15129
Recently an influx of studies claim emergent cognitive abilities in LLMs & doomers warn of AI planning a takeover.
But can LLMs plan?!
Such claims often lack systematic evaluation involving multiple tasks, control conditions, iterations, stats, etc.
We make 2 contributions. 2/n
1-We propose CogEval: a cognitive science-inspired protocol for systematic evaluation of cognitive capacities in LLMs.
Inspired by @mcxfrank's "Experimentology" CogEval operationalizes a capacity w multiple tasks, iterations, domains, & can be applied to various abilities. 3/n
Navigation Turing Test (NTT): Learning to Evaluate Human-Like Navigation arxiv.org/abs/2105.09637
accepted@ICML
We propose a method to evaluate human-like navigation
🧵1/n
Many algorithms pass benchmarks, like navigation from a given location to a goal location in 3D games.
But passing benchmarks doesn't guarantee human-like navigation behavior nor cognitively or neurally plausible human-like algorithms/representations. This matters whether...2/n
...the goal is to use the algorithm to understand human behavior or cognition, as in cog neuro,
or to design agents that generate human-like behavior in XBoX games so humans can play w agents as a team.
Would pursuing these goals simultaneously accelerate achieving both? 3/n
Thrilled to share new work w Stacey Sinclair & @profcikara! Computational Justice: Simulating Structural Bias and Interventions.
We ran agent based simulation of structural bias, params set from studies.
Then simulated/compared different interventions.1/n biorxiv.org/content/biorxi…
@profcikara We distinguish interpersonal bias (sexism) & structural bias, allow social learning. We exclude gender differences in interpersonal bias to isolate effect of structural bias. Unequal gender ratios => gender differences in # sexist comments received & increase in p(sexism). 2/n
@profcikara According to empirical findings 40% women confront sexism: 10% 3/3 times, 10% 2/3, 20% 1/3. Men perceive sexism reported by women 50% times=>we set their p(confront)=1/2 women's. Receiving sexism or objection has a cost=>Costs to women & institutions higher in unequal ratios.3/n