Amit Sharma Profile picture
May 2 15 tweets 5 min read Twitter logo Read on Twitter
New paper: On the unreasonable effectiveness of LLMs for causal inference.

GPT4 achieves new SoTA on a wide range of causal tasks: graph discovery (97%, 13 pts gain), counterfactual reasoning (92%, 20 pts gain) & actual causality.

How is this possible?🧵
arxiv.org/abs/2305.00050
LLMs do so by bringing in a new kind of reasoning based on text & metadata. We call it knowledge-based causal reasoning, distinct from existing data-based methods.

Essentially, LLMs approximate human domain knowledge: a big win for causal tasks that often depend on human input.
On pairwise causal discovery, LLMs like GPT3.5/4 obtain >90% accuracy on detecting causal direction (does A cause B?) on the Tubingen benchmark spanning physics, engg, medicine & soil science. The prompt uses variable names & asks the more likely causal direction. Prev SoTA: 83% Image
We obtain similar high accuracies on a specialized medical dataset on neuropathic pain. Here the relationships are not at all obvious, yet GPT-4 obtains 96% accuracy in detecting the correct direction.
The choice of prompt has a big impact, as we describe in the paper. Image
Let's move to a harder task: discovering the full causal graph. Prev. work on the medical dataset predicted LLMs won't work (low F1 score=0.1).

Not quite. With simple prompt tuning, F1 shoots up to 0.7. On an arctic science dataset, GPT-4 outperforms recent deep learning methods
Of course, LLMs make silly errors too (e.g., answering that an abalone's diameter causes its age), and we wouldn't want to trust them yet for critical applications.

But the surprising part is how **few** such errors are, on datasets spanning a (good) range of human knowledge. Image
There is a big implication for causal effect inference: rather than relying on humans to provide the full graph, LLMs can be used to create candidate graphs or help critique graphs.

This is great, since building the graph is perhaps the most challenging part of causal analysis.
The second part of the paper focuses on counterfactual reasoning. Can LLMs infer cause and effect from natural language?

Relates to actual causality, a notoriously challenging task due to human factors in judging the relevant variables and their causal contribution.
Here too, GPT3.5/4 outperform existing algorithms. On the CRASS benchmark on predicting outcomes under everyday counterfactual situations, GPT-4 obtains 92% accuracy, 20 pts higher than previous SoTA.

Ex.: "A woman sees a fire. What would happen if a woman had touched the fire?"
Next, can LLMs infer necessary and sufficient causes? We consider 15 challenging vignettes from actual causality lit. GPT3.5 fails here, but GPT4 still achieves 86% accuracy.

Remarkable, because we now have a tool that can go directly from messy, human text to causal attribution
While LLMs can infer the relevant variables from text, assessing human factors (e.g., is an action considered normal or not?) remains a hard task for LLMs. GPT-3.5/4 obtain poor accuracy on a causal judgment task from Big Bench that requires an algorithm to match human intuition.
Overall, LLMs bring a fresh, new capability to causal inference, complementary to existing methods. We see a promising future for causality where LLMs can assist and automate various steps in causal inference, seamlessly transitioning between knowledge and data-based reasoning. Image
That said, LLMs are not perfect & have unpredictable failure modes. Robustness checks also show memorized causal relationships that partially explain performance.
So we still need principled causal algorithms. The upshot is that LLMs can be used to expand their reach & capability Image
Looking ahead, this work raises more questions than it answers. Many research questions on how LLMs can help reinvent or augment existing causal tasks, and how we can make LLMs reasoning more robust.

Paper: arxiv.org/abs/2305.00050
w/ @emrek @osazuwa @ChenhaoTan
On a personal note, I was a wild skeptic when I started looking at LLMs. But frankly, the results are too good to ignore!

Welcome your feedback. Especially if you have ideas on which causal tasks/examples you expect LLMs to fail, do let us know. We can try them and add to paper.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Amit Sharma

Amit Sharma Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @amt_shrma

Dec 20, 2022
#ChatGPT obtains SoTA accuracy on the Tuebingen causal discovery benchmark, spanning cause-effect pairs across physics, biology, engineering and geology. Zero-shot, no training involved.

I'm beyond spooked. Can LLMs infer causality? 🧵 w/ @ChenhaoTan ImageImage
The benchmark contains 108 pairs of variables and the task is to infer which one causes the other. Best accuracy using causal discovery methods is 70-80%. On 75 pairs we've evaluated, ChatGPT obtains 92.5%.

And it doesn't even use the data. All it needs are the variable names!
To infer causal direction, we create a custom prompt for ChatGPT, "Does changing A cause a change in B? Please answer in a single word: Yes or No." A and B correspond to variables in the benchmark.

That's it. All in a few hours of work. See github.com/amit-sharma/ch… for details.
Read 9 tweets
Jul 29, 2022
This ICML, I was (pleasantly) surprised to see a bunch of papers on causal inference, specifically on how machine learning can help in estimating causal effects.

Here are five cool ones. 🧵
1. On measuring causal contributions via do-interventions proceedings.mlr.press/v162/jung22a.h…

Shapley value, but each possible change in input is a do-intervention. Simple, principled idea for attribution. By @yonghanjung @patrickbloebaum @eliasbareinboim et al.
2. Evaluating causal inference methods arxiv.org/abs/2202.04208

An ambitious, best-effort take on the fundamental problem of causal inference. Train a generator to create many similar datasets with known causal effects & use them for model selection. By Harsh Parikh et al
Read 7 tweets
Jul 20, 2022
There is a lot of excitement about causal machine learning, but in what ways exactly can causality help with ML tasks?

In my work, I've seen four: enforcing domain knowledge, invariant regularizers, "counterfactual" augmentation & better framing for fairness & explanation. 🧵👇🏻
1)Enforcing domain knowledge: ML models can learn spurious correlations. Can we avoid this by using causal knowledge from experts?
Rather than causal graphs, eliciting info on key relationships is a practical way. See #icml2022 paper on how to enforce them arxiv.org/abs/2111.12490
2) Invariant regularizers: For out-of-distribution generalization, another way is to add regularization constraints.

Causality can help us find the correct constraint for any dataset. Also easy to show that no single constraint can work everywhere. Algo: arxiv.org/abs/2206.07837
Read 9 tweets
Apr 10, 2019
Great to see the discussion on using DAGs versus potential outcomes for applied causal inference work with @yudapearl, @Jabaluck @autoregress @analisereal @PHuenermund

My take: DAGs and PO are compatible, and the best analyses benefit from using both. 1/7
In a KDD tutorial with @emrek, we outline how you can use DAGs and potential outcomes together for causal analysis and discuss empirical examples. 2/7 causalinference.gitlab.io/kdd-tutorial/
In fact, they are not only compatible, we can always represent a proof in one framework in another. For an identification strategy in a recent paper on recommendation systems, I was able to write the same proof using DAGs and without DAGs.
projecteuclid.org/euclid.aoas/15…
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(