We have trained ESMFold to predict full atomic protein structure directly from language model representations of a single sequence. Accuracy is competitive with AlphaFold on most proteins with order of magnitude faster inference. By @MetaAI Protein Team.
We train ESM2 language models from 8M up to 15B parameters. Improvements in language modeling perplexity and learning of structure continue through 15B. ESM2 at 150M parameters is better than ESM1b at 650M parameters.
As ESM2 processes a protein sequence, a picture of the protein’s structure materializes in its internal states that enables atomic resolution predictions of the 3D structure, even though the language model was only trained on sequences.
There are billions of protein sequences with unknown structure and function, many from metagenomic sequencing. ESMFold makes it feasible to map this structural space in practical timescales. We were able to fold a random sample of 1M metagenomic sequences in a few hours.
A large fraction have high confidence and are different from any known experimental structure. Many have sequences without matches in annotated sequence databases. We think ESMFold can help to understand regions of protein space that are distant from existing knowledge.
In two new papers we have found that the ESM2 language model generalizes beyond natural proteins, and enables programmable generation of complex and modular protein structures.
ESM2 learns the design principles of proteins. With @uwproteindesign we experimentally validated 152 ESM2 designs, including de novo generations outside the space of natural proteins (<20% sequence identity to known proteins).
We implemented a high level programming language for generative protein design with ESM2. This made it possible to program the generation of large proteins and complexes with intricate and modular structures.
1/9 Today we’re excited to release Transformer models pre-trained on evolutionary-scale protein sequence data along with a major update to our preprint from last year:
2/9 We added extensive new benchmarks for remote homology, secondary structure, long range contacts, and mutational effect. Improvements to downstream models lead to SOTA features across multiple benchmarks.
3/9 There are two larger questions we’re interested in answering: (1) can language models learn biology from sequences; (2) are there favorable scaling laws for data and model parameters, i.e. similar to those observed in NLP. In new work we find support for both.