Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Chelsea Finn

@chelseabfinn

Jul 20, 2021 • 12 tweets • 7 min read • Read on X

Scrolly

@mike_h_wu

Thrilled to share new work on AI for education: can we give detailed, high-quality feedback to students?

Post: ai.stanford.edu/blog/prototran…
NYT Coverage: nytimes.com/2021/07/20/tec…

A collab w. the amazing @mike_h_wu @chrispiech & co 🧵

2/ Student feedback is a fundamental problem in scaling education.

Providing good feedback is hard: existing approaches provide canned responses, cryptic error messages, or simply provide the answer.

3/ Providing feedback is also hard for ML: not a ton of data, teachers frequently change their assignments, and student solutions are open-ended and long-tailed.

Supervised learning doesn’t work. We weren’t sure if this problem can even be solved using ML.

4/ But, we can frame it as a few-shot learning problem! Using data from past HWs and exams of Stanford’s intro CS course, we train a model to give feedback for a new problem with only ~20 examples.

Humans are critical to this process: instructors define a rubric & feedback text.

5/ Because this is open-ended Python code, our base architecture is transformers + prototypical networks.

But, there are many important details for this to *actually* work: task augmentation, question & rubric text as side info, preprocessing, and code pre-training.

6/ How well does this work?

In offline expts, meta-learning:
* achieves 8%-21% greater accuracy than supervised learning
* comes within 8% of a human TA on held-out exams.

Ablations show a >10% difference in accuracy with different design choices.

7/ Most importantly, this model was deployed to 16,000 student solutions in Code-in-Place, where it was previously not possible to give feedback.

In a randomized blind A/B test, students preferred model feedback slightly *more* than human feedback AND rated usefulness as 4.6/5.

8/ We also did several checks for bias. Among the countries & gender identities with the most representation (i.e. largest statistical power), we see no signs of bias.

Not too surprising given the model only sees typed python code w/o comments, but super important to check.

@chrispiech

9/ At the beginning of this project, we had no idea that the goal would be possible, let alone deployable.

I still remember very naively approaching @chrispiech about using meta-learning for education after watching a talk he gave >1.5 years ago. :)

10/ It’s super exciting to see both real-world impact of meta-learning algorithms + substantive progress on AI for education

Post: ai.stanford.edu/blog/prototran…
Paper: tinyurl.com/meta-edu-paper
Coverage: nytimes.com/2021/07/20/tec…

@CadeMetz

11/ Finally, an important perspective that’s also in @CadeMetz's NYT article:

These systems can't and shouldn't replace instructors.

Their promise instead lies in helping instructors reach more students, especially those who don't have access to high-quality education.

@mike_h_wu

12/12
Thank you to the Code-in-Place team for allowing us to try out this experiment in their awesome course: codeinplace.stanford.edu

A truly collaborative effort with @mike_h_wu, @alan7cheng, @chrispiech, @stanfordcocolab

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @chelseabfinn

Chelsea Finn

@chelseabfinn

Jan 27, 2023

LLMs like ChatGPT are becoming more fluent – how can we detect if something was written by a language model or a human?

We developed DetectGPT: a method for detecting if a passage was written by a particular language model.

Why does this matter?

Large language models are already being used to:
* write news articles (sometimes with major errors!)
* cheat on homework

Can we help humans spot LLM-written text?

cnet.com/tech/cnet-is-t…

So, how does it work?

DetectGPT measures the probability that a model assigns to the written text and compares that to the prob it assigns to a modification of the text.

If the probability of the original is much higher than the modified text, it’s likely generated by the model

Read 6 tweets

Chelsea Finn

@chelseabfinn

Oct 27, 2022

Common fine-tuning wisdom is to adapt the last layer or the entire neural net.

We find that, sometimes, fine-tuning *only* the first layers or middle layers works best.

Paper: arxiv.org/abs/2210.11466

A short 🧵

One of the most reliable ways to handle distr. shift is to fine-tune on a small amt. of data.

We find that the best layers to fine-tune depends on the *type* of shift!

Compared to fine-tuning the whole network, fine-tuning just one block achieves similar or higher accuracy. ⬇️

Why might this be the case?

We don't know.🤔

But, perhaps neural nets approximately invert the causal process & these distr shifts are changes in independent causal mechanisms.

This kind of analysis can probably also shed light on the nature of different kinds of distr shifts!

Read 4 tweets

Chelsea Finn

@chelseabfinn

Feb 9, 2022

What should ML models do when there's a *perfect* correlation between spurious features and labels?

This is hard b/c the problem is fundamentally _underdefined_

DivDis can solve this problem by learning multiple diverse solutions & then disambiguating
arxiv.org/abs/2202.03418
🧵

https://twitter.com/kchonyc/status/1455589673229770757

Prior works have made progress on robustness to spurious features but also have important weaknesses:
- They can't handle perfect/complete correlations
- They often need labeled data from the target distr. for hparam tuning

https://twitter.com/kchonyc/status/1455589673229770757

DivDis can address both challenges, using 2 stages:
1. The Diversify stage learns multiple functions that minimize training error but have differing predictions on unlabeled target data
2. The Disambiguate stage uses a few active queries to identify the correct function

Read 7 tweets

Chelsea Finn

@chelseabfinn

Oct 22, 2021

@_eric_mitchell_

Large language models (LLMs) often make mistakes that are difficult to correct.

We study the problem of quickly editing these models:
Paper: arxiv.org/abs/2110.11309
Code: github.com/eric-mitchell/…

w/ @_eric_mitchell_, C. Lin, @ABosselut, @chrmanning

thread 🧵👇

We assume a pre-trained model & a dataset that covers many possible model edits

Then, we meta-train a model editor that predicts a model update that:
- edits the model
- otherwise keeps the model behavior the same

(2/4)

You can train model editors for massive models (e.g. GPT-J, T5-11B) in <1 day on a single GPU.

Edits with the resulting model editor are extremely fast, with edit success rate of 80-90%.
(3/4)

Read 4 tweets

Chelsea Finn

@chelseabfinn

Sep 22, 2021

RL methods so often learn from _scratch_. Can they leverage offline experience from previous tasks?

They can. And if they do, they will learn new tasks ~2x faster.

Paper: arxiv.org/abs/2109.09180
Website: sites.google.com/view/retain-ex…

Led by Annie Xie. 🧵👇(1/4)

Many prior transfer learning methods try to transfer weights, e.g through fine-tuning.

We consider whether we can also transfer past *experiences*, rather than throwing away the prior data.
(2/4)

Retaining & filtering experiences performs *substantially* better than fine-tuning & other prior methods

It also outperforms learning from scratch, even when learning from scratch with 2x the data.
(3/4)

Read 4 tweets

Chelsea Finn

@chelseabfinn

Apr 1, 2021

@_anniechen_

How can robots generalize to new environments & tasks?

We find that using in-the-wild videos of people can allow learned reward functions to do so!
Paper: arxiv.org/abs/2103.16817

Led by @_anniechen_, @SurajNair_1
🧵(1/5)

To get reward functions that generalize, we train domain-agnostic video discriminators (DVD) with:
* a lot of diverse human data, and
* a narrow & small amount of robot demos

The idea is super simple: predict if two videos are performing the same task or not.
(2/5)

This discriminator can be used as a reward by feeding in a human video of the desired task and a video of the robot’s behavior.

We use it by planning with a learned visual dynamics model.
(3/5)

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Chelsea Finn

Try unrolling a thread yourself!

More from @chelseabfinn

Chelsea Finn

Chelsea Finn

Chelsea Finn

Chelsea Finn

Chelsea Finn

Chelsea Finn

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!