Really excited to announce that AntiBERTa is now published in @Patterns_CP! Here we describe a transformer model that demonstrates understanding of antibody sequences 🧵 (1/6)
We pre-train a transformer model based on RoBERTa. We exclusively use full-length antibody/B-cell receptor sequences using the MLM objective. Other similar transformers FYI include BioPhi (@prihodad), ABLang (@HegelundOlsen), AntiBERTy (@jeffruffolo) (2/6)
We show that the embeddings pick up nuanced features of BCR/antibody sequences. For example, V gene usage mutational load, and remarkably, B cell provenance. This is all done in a zero-shot setting, i.e. none of these labels were provided during pre-training. (3/6)
Transformers are powered by self-attention and AntiBERTa is no exception. We see that the self-attention maps correlate broadly to positions of contact. While not perfect, AntiBERTa does seem to understand some pairwise dependencies (4/6)
Finally, we fine-tune the model for paratope prediction and show that it can achieve SOTA performance. This helps us think about novel ways in which we can investigate convergence in repertoire datasets, such as using Paratyping (@EveRichardson20). (5/6)
First, the metrics are RMSDs based on aligning the C/N/CA/CB atoms across the chain, then calculating the RMSD across a region. i.e. align every residue of the VH, then calculate RMSD across CDRH3, or CDRH1, etc. This is on ~35 antibodies of the ImmuneBuilder test set (2/5)
ESMFold's CDRH3 accuracies are better than what I expected. Where it's let down is on the "canonical" CDRs. It would've been nice to compare the VH-VL orientations and talk about how ImmuneBuilder doesn't generate D-amino acids, etc. (3/5)
First, it's pretty crazy we even have antibody-specific tools, since #AlphaFold2, #ESMFold, #OmegaFold, all do a decent job at antibody modelling. However, antibody-specific tools have -some- feature that's necessary (e.g. being MSA-free) (2/6)
The demand is likely due to interest from pharma & biotech, but we don't have anywhere near the same level of interest for other polymorphic proteins like TCRs and MHCs (🤔). Regardless, with such interest, I think an antibody-specific CASP should be resurrected! (3/6)
Context: a single-chain Fv (scFv) is an antibody construct whose heavy and light chains are linked. It's not the conventional "Y" shape molecule, and is useful for engineering / phage display, etc. See @AlissaHummer's post blopig.com/blog/2021/07/a… (2/5)
Thermostability (measured by TS50, the temperature when scFv loses binding) is weakly predicted by 0-shot and fine-tuning via transformers (ESM-1v + ESM-1b). CNNs using sequence and structural (energy) convolutions perform better (?) [hard to tell, sorry!🙈] (3/5)
Predicting Ab-Ag interactions is a sub-problem of the protein-protein interaction problem. There are many facets to consider here, including but not limited to, identifying the correct antigen (let alone the correct epitope), the correct paratope, orientation, etc (2/5)
@antibodymap's team show first that true Ab-Ag pairs (i.e. those where we know the Ab binds antigen) and false Ab-Ag pairs (i.e. Ag was randomly given to an Ab), the pIDDT scores are incomparable, suggesting score-based discrimination is HARD. (3/5)