In two new papers we have found that the ESM2 language model generalizes beyond natural proteins, and enables programmable generation of complex and modular protein structures.
ESM2 learns the design principles of proteins. With @uwproteindesign we experimentally validated 152 ESM2 designs, including de novo generations outside the space of natural proteins (<20% sequence identity to known proteins).
We have trained ESMFold to predict full atomic protein structure directly from language model representations of a single sequence. Accuracy is competitive with AlphaFold on most proteins with order of magnitude faster inference. By @MetaAI Protein Team.
biorxiv.org/content/10.110…
We train ESM2 language models from 8M up to 15B parameters. Improvements in language modeling perplexity and learning of structure continue through 15B. ESM2 at 150M parameters is better than ESM1b at 650M parameters.
Dec 4, 2020 • 5 tweets • 1 min read
Very exciting results this week from AlphaFold in CASP14. An incredible and inspiring achievement by the DeepMind team. Many new possibilities.
*Attention* mechanism is key to the result. Interestingly we find the exact same in our work on *unsupervised* learning for proteins.
The idea in protein language modeling: learn biology directly from patterns in sequences from across evolution.
Protein language modeling is unsupervised, i.e. it learns from sequences, not structures. (AlphaFold learns from structures).
Sep 2, 2020 • 9 tweets • 3 min read
1/9 Today we’re excited to release Transformer models pre-trained on evolutionary-scale protein sequence data along with a major update to our preprint from last year:
Paper: biorxiv.org/content/10.110…
Models: github.com/facebookresear…2/9 We added extensive new benchmarks for remote homology, secondary structure, long range contacts, and mutational effect. Improvements to downstream models lead to SOTA features across multiple benchmarks.