Excited to announce PoET, our (@timt1630, @tbepler1) retrieval-augmented generative protein language model that achieves state-of-the-art unsupervised variant function prediction performance on #ProteinGym. #MachineLearning#ProteinML 1/9
Inspired by evolution, PoET conditions on observed protein sequences to infer fitness constraints and extrapolate a generative distribution of protein sequences. This allows PoET to be focused on any level of homology, from superfamilies to families to subfamilies and beyond. 2/9
The key idea in the design of PoET was to create a transformer that could condition on homologous sequences but did not require aligned inputs. Our solution was to model the generative process of whole protein families as a sequence-of-sequences generative modeling problem. 3/9
But, because the order of sequences in a family is arbitrary, we developed a unique transformer layer to efficiently attend to ordered residues within each sequence, but treat the sequences themselves as an unordered set. 4/9
This was also critical for enabling PoET to extrapolate to context lengths well beyond what we used during training. A PoET model trained with 8k context tokens easily generalizes to 64k context lengths and beyond. 5/9
As a retrieval-augmented language model, PoET is not limited to its training data. It can learn from sequences from any database without retraining. I’m really excited to see what’s possible with creative prompt/context engineering! 6/9
PoET is able to generate high diversity, high fitness variants and is not limited to substitutions. It can be used to generate and score indels as well! 7/9
Here, we used #MachineLearning to design high diversity antibody variants with orders of magnitude greater potency than could be found with conventional directed mutagenesis. #ProteinML#AntibodyEngineering
Our method uses protein embeddings and Bayesian ML to design optimized antibody variant libraries, and we compare directly with other methods in a head-to-head prospective design study. 2/4
It was wonderful to collaborate with Lin Li, Rajmonda Caceres, Matt Walsh, and the rest of the @MITLL, @MIT, and @AAlphaBio teams on this project! 3/4