Aaron Roth Profile picture
Sep 19 11 tweets 4 min read Read on X
Aligning an AI with human preferences might be hard. But there is more than one AI out there, and users can choose which to use. Can we get the benefits of a fully aligned AI without solving the alignment problem? In a new paper we study a setting in which the answer is yes. Image
Imagine predictive medicine LLMs being used by a doctor to help treat patients. One is made by Merck. The doctor wants to cure his patients as quickly as possible. The Merck LLM also tries to do this, but has a preference for Merck drugs, resulting in substantial misalignment.
Say there is another predictive medicine AI out there made by Pfizer. It is similar but has a preference for Pfizer drugs. Both AIs are substantially misaligned, but they are differently misaligned, and the doctor's utility is well approximated by the average of the AI utilities.
Merck and Pfizer both know that they are competing in an AI marketplace, and deploy their LLMs strategically to advance their goals. Given this competition, perhaps the doctor can use both AIs to the same effect as if there was an AI that was perfectly aligned.
We give several mathematical models of AI competition in which the answer is yes, provided that the user's utility function lies in the convex hull of the AI utility functions. Under this condition, all equilibria of the game between AI providers leads to high user utility. Image
Even if all of the AI providers are very badly aligned, if it is possible to approximate the user's utility by any non-negative linear combination of their utilities, then the user does as well as they would with a perfectly aligned AI. Alignment emerges from competition. Image
We give simple experiments (where LLM personas are generated with prompt variation) demonstrating that representing user utility functions somewhere in the convex hull of LLM utility functions is a much easier target than finding a single well aligned LLM utility function. Image
This is intuitive. Think about politics. It might be hard to find a public figure who agrees with you on every issue --- but its probably not hard to find one who is to the right of you on most issues, -and- one who is to the left of you on most issues. Convex hulls are large.
We conduct another simple experiment in which we explicitly compute equilibria amongst differently misaligned sets of agents. The results validate our theory --- user utility can sometimes match our worst case bound (so its tight), but is often much better. Image
I'm excited about this line of work. We get clean results in a stylized setting, and there is much to do to bring these kinds of ideas closer to practice. But I think that ideas from market and mechanism design should have lots to say about the practical alignment problem too. Image
The paper is here: and is joint work with the excellent @natalie_collina, @SurbhiGoel_, Emily Ryu, and Mirah Shi. arxiv.org/abs/2509.15090Image
Image
Image
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Aaron Roth

Aaron Roth Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Aaroth

Apr 9, 2024
We'd like LLM outputs to be accompanied by confidence scores, indicating the confidence we should have in them. But what semantics should a confidence score have? A minimal condition is calibration: e.g. when we express 70% confidence, we should be correct 70% of the time. But... Image
LLM prompts can be very different. A model might be much more likely to hallucinate when asked for citations to the functional analysis literature vs. when asked for state capitals. Calibrated models can be systematically over-confident for one and under-confident for the other.
Multicalibrated confidence scores are calibrated not just overall, but conditional on extra context. Traditionally multicalibration is used with tabular data where features are explicit. What should we multicalibrate with respect to when we are scoring LLM completions?
Read 6 tweets
May 10, 2023
To what extent can calibrated predictions be viewed as "real probabilities" --- somehow a measure of truth, rather than estimation, even when there is no underlying probabilistic process? I'll explain a simple but striking early result of Philip Dawid that isn't so well known. 🧵
Calibration asks that probability estimates be self-consistent: averaged over all of the days I claim it should rain 20% of the time, it should rain 20% of the time. Similarly for 30%, 40%, etc. On its own calibration is quite weak.
If it happens to rain on cloudy days (which constitute half the days) but not on clear days, I can predict a 50% chance of rain every day and be calibrated. OR I can predict a 100% chance of rain on cloudy days and 0% chance of rain on clear days, and be calibrated.
Read 9 tweets
Oct 3, 2022
Our new paper gives very simple algorithms that promise "multivalid" conformal prediction sets for exchangable data. This means they are valid not just marginally, but also conditionally on (intersecting!) group membership, and in a threshold calibrated manner. I'll explain! 🧵
Instead of making point predictions, we can quantify uncertainty by producing "prediction sets" --- sets of labels that contain the true label with (say) 90% probability. The problem is, in a k label prediction problem, there are 2^k prediction sets. The curse of dimensionality!
One of the great ideas of conformal prediction is that if we can find a good "non-conformity score" s(x,y) telling us how unusual a label y seems for features x, we can focus on a 1-parameter family of prediction sets P(x, t) = {y : s(x,y) < t}. Now the problem is just to find t.
Read 15 tweets
Jun 3, 2022
Machine Learning is really good at making point predictions --- but it sometimes makes mistakes. How should we think about which predictions we should trust? In other words, what is the right way to think about the uncertainty of particular predictions? A thread about new work 🧵
First, some links. Here is our paper: arxiv.org/abs/2206.01067 Here is me giving a talk about it: simonsfoundation.org/event/robust-a… It’s joint work with Bastani, Gupta, @crispy_jung, Noarov, and Ramalingam. Our code will shortly be available on github, in the repository linked in the paper.
A natural way to quantify uncertainty is to predict a set of labels rather than a single one. Pick a degree of certainty --- say 90%. For every prediction we make, we'd like to return the smallest set of labels that is guaranteed to contain the true label 90% of the time.
Read 14 tweets
Dec 18, 2020
Ok, I don't have 2020 favorite papers for 2020, but I did learn a couple of things this year. Here is a slow moving thread of the ideas I learned about this year. Many of these ideas are old, but they were new to me!🧵
1) Calibration via the minmax theorem. This is an old idea of Sergiu Hart's, that he originally communicated verbally to Foster and Vohra (it appears with credit in their classic 1998 paper). Sergiu wrote it up this year in this short note: ma.huji.ac.il/hart/papers/ca…
Suppose you are a weather forecaster, and you predict the probability of rain each day. Your forecasts are calibrated if on all of the days on which you predict a 10% chance of rain, it rains 10% of the time, etc. One way to be calibrated is to really understand weather.
Read 32 tweets
Dec 8, 2019
A nice new fairness paper by Blum and Strangl: arxiv.org/pdf/1912.01094… They show that if there are two populations with the same base rate, but then data is biased either by undersampling positive examples from population B, or by corrupting positive labels in population B... 1/3
Then ERM subject to the constraint of equalizing true positive rates across groups recovers the optimal classifier on the original (unbiased) data distribution. Other fairness constraints (like also equalizing false positive rates, or asking for demographic parity) don't. 2/3
The really nice thing about this result is that unlike some other methods (like reweighting) for correcting biased data collection, this handles a relatively wide range of bias models in a detail free way: the ERM algorithm doesn't need to know the parsmeters of the bias model.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(