"Using Large Language Models to Simulate Multiple Humans"
What if we asked langauge models to complete text describing what a human would do in a situation? Would they produce realistic answers? How close to human behavior would they get? [1/14]
The authors of this paper answer these questions by simulating classic psych studies, with participant responses given by GPT-3 variants. [2/14]
Studies they consider include: classifying sentences as grammatical or ungrammatical, the Milgram electric shock study (en.wikipedia.org/wiki/Milgram_e…), [3/14]
What they found was that the responses of large models roughly matched what humans actually did in these experiments. E.g., the relative frequency of… [6/14]
…subjects accepting an offer in the simulated ultimatum game matched the true relative frequency when the experiment is run on humans. [7/14]
There’s a similar result for the Milgram shock study. Like humans, GPT-3 simulating humans would rather give high-voltage shocks to a subject than disobey the experimenter. [8/14]
This is an impressive level of similarity to human behavior, but I’ll add a couple caveats. First, as the authors note, there are likely descriptions of these studies in the training data. They tried to tweak the text and structure, but there could still be leakage. [9/14]
Second, the responses people give in these experiments vary greatly across cultures (authors.library.caltech.edu/2278/). So we can say the models match what some (mostly Western) humans do, but not what all humans do. [10/14]
But overall, I really appreciate this paper. It’s a set of questions I’ve never seen anyone ask before, with rigorous experiments to answer them. Publishing papers that… [11/14]
…don’t fit into an existing mold can be hard, so I just want to praise them for getting off the beaten path and trying something interesting. [12/14]
This is also further evidence for my thesis that cognitive science is becoming relevant for AI again.
"Understanding Scaling Laws for Recommendation Models"
For two years, the AI world has had this glorious period of believing that big tech companies just need more compute to make their models better, not more user data.
That period is ending. Here's what happened: [1/14]
In 2020, OpenAI published a paper (arxiv.org/abs/2001.08361) assessing the relative effects of scaling up models vs datasets. They found that scaling up models had *way* higher returns. [2/14]
The party was on. We got libraries like DeepSpeed (github.com/microsoft/Deep…) that let you train huge models across countless GPUs. We got trillion-parameter… [3/14]
"No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects"
Instead of using a pooling layer or having a stride for your conv, just use a space-to-depth op followed by a non-strided conv. [1/8]
This substitution seems to be an improvement. [2/8]
This is especially true for small models and when detecting small objects. Most importantly, these improvements seem to hold even when conditioning on single-image inference latency. This is important because it's easy to do "better" when you're slower. [3/8]
"Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models"
Another optimizer paper attempting to descend through a crowded valley to beat Adam. But...maybe this one actually does? [1/11]
Their update equation is fairly straightforward, and complements the gradient momentum term with a difference-of-gradients momentum term. [2/11]
It does have an extra hyperparameter compared to Adam (β3), but they hardcode it to 0.08 in all their experiments, so it’s apparently not important to tune. [3/11]
"What Can Transformers Learn In-Context? A Case Study of Simple Function Classes"
Can models learn new, non-trivial functions...with no parameter changes? Turns out the answer is yes, with in-context learning: [1/11]
In-context learning is when you include some examples as text in the prompt at test time. Here's a great illustration from @sewon__min et al. (arxiv.org/abs/2202.12837). [2/11]
@sewon__min What's new in this paper is that they systematically assess how well in-context learning works for various well-defined function classes. [3/11]
"Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP"
How well do image-text models trained on a given dataset generalize to other datasets? [1/12]
The answer is: it’s complicated. Different pretraining datasets work better for different downstream datasets. [2/12]
One interesting but inconvenient result is that mixing more upstream datasets doesn’t necessarily work better. The benefits of the best dataset get diluted by others. [3/12]
"Language Models Can Teach Themselves to Program Better"
This paper changed my thinking about what future langauge models will be good at, mostly in a really concerning way. Let's start with some context: [1/11]
To teach models to program, you used to give them a natural language prompt. But recent work has shown that you can instead just show them a unit test and tell them to… [2/11]
…generate a program that satisfies it (a “programming puzzle”). This is way nicer because it’s simpler and you can just run the code to see if it works. [3/11]