Andrew White 🐦‍⬛'s Threads

Dec 31, 2024 • 7 tweets • 4 min read

Finishing 2024 with one more research result! We’ve trained small language agents to do hard sci tasks: engineering proteins, manipulating DNA, and working with sci literature in a new library called Aviary. We beat humans and frontier LLMs on these tasks!

Aviary is a gymnasium of new scientific environments. Using behavior cloning, expert iteration, and consensus sampling we’ve trained Llamma-3.1 8B agents to very high accuracy on challenging multi-step tasks. And at low cost!
futurehouse.org/research-annou…

Oct 23, 2024 • 6 tweets • 3 min read

We’ve just finished writing the missing 15,616 Wikipedia articles to get complete coverage of all 19,255 human genes. We used PaperQA2, which has higher accuracy than existing human-written Wikipedia articles, as judged by blinded biology PhD students and postdocs. 1/5

link:
These articles considered almost 1M research papers. At our current infra, we could rewrite all 19.3k articles every week so they are always up to date. We could rewrite all articles about research on Wikipedia every three weeks. 2/5 wikicrow.ai

May 8, 2024 • 8 tweets • 4 min read

ChemCrow is out today in @NatMachIntell! ChemCrow is an agent that uses chem tools and a cloud-based robotic lab for open-ended chem tasks. It’s been a journey to get to publication and I’d like to share some history about it. It started back in 2022. 1/8

I was working as a red teamer for GPT-4 and kept getting hallucinated molecules when trying to get up to trouble in chemistry. Then I tried the ReAct agent (from @ShunyuYao12 ) quickly saw real molecules. This work eventually was public in GPT-4 technical report 2/8

Jun 6, 2023 • 4 tweets • 2 min read

How can you learn to predict peptide properties without negative examples? This happens often when trying to analyze outputs from screening results. We explore various approaches in this new paper from @MehradAnsari. 1/4

biorxiv.org/content/10.110…

Peptide screening usually gives positive examples, which makes it difficult to train a classifier. Previous work has been done on this - including one-class SVM. We evaluate these and propose a modified algorithm built on "spies" 2/4

Apr 12, 2023 • 4 tweets • 2 min read

How can you check if a molecule is present in a >10B dataset in 0.2 ms? With bloom filters! Checkout our preprint on bloom filters by @4everstudent95 1/4

Code: github.com/whitead/molblo…
Paper: arxiv.org/abs/2304.05386

Bloom filters are fast and can store ultra large chemical libraries in RAM, at the cost of a false positive rate of 0.005 (can tune this!) 2/4

Apr 12, 2023 • 6 tweets • 3 min read

Our preprint on using GPT-4 as an agent with tools for chemistry is out! We call it ChemCrow. Working with @SamCox822, @drecmb @pschwllr, we developed a set of tools for synthesis/cond, safety, commercial availability, patents, paper-qa

arxiv.org/abs/2304.05376 1/5

We, unsurprisingly, found that GPT-4 with tools is much better than GPT-4 alone. Here it outlines a synthesis for atorvastatin complete with steps, an ingredient list, cost, and suppliers. We implement this with @LangChainAI (great library!)

Apr 3, 2023 • 4 tweets • 2 min read

I've been exploring if GPT-4 and other models (please give me a key @AnthropicAI!!) can do "algebra" of molecules. Let's see a few examples 1/4

demo: whitead.github.io/svelte-chem-al…

First - "mutate." Basically create similar molecules from the given molecule. This is interesting in modifying compounds for design or XAI - building out local chemical spaces. 2/4

Mar 16, 2023 • 10 tweets • 4 min read

Can GPT-4 do drug discovery? No, but it can help. Let's walk through GPT-4 proposing new drugs. This is called knowledge-based screening. We're trying to fill a list of plausible compounds that could lead to new drugs based on research papers. 1/n

This is one small step in drug discovery. There are many others! The compounds GPT-4 proposes have to be made and tested, and then they just start a path towards a new drug. Let's do a new example for psoriasis by targeting a known protein TYK2. Here is the prompt. 2/n

Feb 10, 2023 • 6 tweets • 3 min read

My research group's @LangChainAI hackathon projects🙌 Great job to all of them and I hope anyone reading this gets a glimpse into the future of chemistry. These were done in 1 week 1/5 The first from @GWellawatte - input is a protein structure PDB ID and question about the protein and the output is cited answers about it. Works by downloading papers from PDB affiliated with ID. 2/5

Feb 10, 2023 • 5 tweets • 2 min read

OK, kids are in bed. Time to learn @Gradio. Wish me luck! @Gradio Some notes: wish there was a default to have python error messages show up somewhere (maybe a standard component)

Feb 2, 2023 • 6 tweets • 2 min read

I just paid $60 to embed the text of the entire lord of the rings trilogy so I could have GPT answer a question I've wondered my whole evening: Do the people of Middle Earth poop? 1/6 I did this using @gpt_index and @LangChainAI to bring up all the relevant passages from the book and combine into a chain of prompts answered by GPT3.5 There were not many relevant passages to work with, so the model had some trouble. 2/6

Aug 7, 2022 • 6 tweets • 3 min read

New preprint on pre-trained models for Bayesian optimization (BO) of sequences! We show LLMs trained on protein seqs can replace Gaussian processes in BO. Examples: BO of peptide inhibitors with AlphaFold and iterative design of proteins. 1/6
biorxiv.org/content/10.110…

We wanted to combine few-shot capabilities of pre-trained models with BO. We found deep ensembles can give LLMs uncertainty and the reparameterization trick enables gradients on sequences. This enables explore/exploit of BO with accuracy of LLMs. 2/6

Apr 13, 2022 • 10 tweets • 3 min read

I've put together a few of my favorite discussions on the details of doing molecular dynamics. I'll add more as they come. Hopefully they're useful to you! 🧵1/n 2/n A discussion about assessing uncertainty in metadynamics

https://twitter.com/PaulRobustelli/status/1455915126566133768

Aug 31, 2021 • 7 tweets • 5 min read

1/6 For the last few months @glenhocky and I have been asking what large language models (LLM) can do for chemistry. In our new preprint, we show LLMs know a bit of chemistry and can do a lot: like compute the dissociation curve of H2.
arxiv.org/abs/2108.13360

2/6 LLMs that can generate code have reached accuracy that makes them usable in research. In their training, they picked up knowledge of chemistry. If you ask @OpenAI's Codex to draw caffeine it knows both how to draw a molecule and the structure of caffeine.

Share this page!

Enter URL or ID to Unroll