Sergey Ovchinnikov 🇺🇦 Profile picture
Oct 24, 2021 12 tweets 8 min read Read on X
End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman
biorxiv.org/content/10.110…
A fun collaboration with Samantha Petti, Nicholas Bhattacharya, @proteinrosh, @JustasDauparas, @countablyfinite, @keitokiddo, @srush_nlp & @pkoo562 (1/8)
Many methods like GREMLIN, MSA_transformer, RosetTTAFold and AlphaFold rely on input MSA generated by non-differentiable methods. (2/8)
We ask the question, what if we make the red arrow differentiable and optimize end-to-end. (3/8)
To accomplish this, we implement a differentiable alignment module (LAM). More specifically a vectorized/ striped smith-waterman via #JAX that is extremely fast (4/8)

Special thanks to @jakevdp for #JAX help! 😎
github.com/google/jax/dis…
Given AlphaFold and LAM are conveniently implemented in #JAX, as a proof-of-concept, we backprop through AlphaFold and LAM to maximize the confidence metrics (pLDDT and pAE) (5/8)
Maximizing pLDDT (and potentially "learning" a more optimal MSA) often increases structure prediction over our initial input MSAs. (6/8)
LAM also allows us to convert GREMLIN into SMURF (Smooth Markov Unaligned Random Field) that simultaneously learns an MSA, coevolution and conservation of a given rna/protein family. (7/8)
Learning the MSA+Coevolution end-to-end matches and sometimes exceeds the performance of precomputed MSA on proteins and RNA for task of contact prediction. (8/8)
We'll make the code public in a day or two. The owner of our shared GitHub account is currently traveling. 😂
The source code is now public: 👀
github.com/spetti/SMURF
@jakevdp Note: We are not the first to implement a differentiable alignment module in bio.

Previous Implementations:
bmcbioinformatics.biomedcentral.com/articles/10.11…

Most recent iterations:
pytorch-struct:
github.com/harvardnlp/pyt…

deepblast (cuda)
biorxiv.org/content/10.110…

julia:
live.juliacon.org/talk/QB8EC8
@jakevdp Oops! Thanks to @thesteinegger for pointing out we had actually implemented an "anti-diagonal" not a "striped" vectorization of smith-waterman.

First described by Wozniak et al. Using video-oriented instructions to speed up sequence comparison. (1997)

bmcbioinformatics.biomedcentral.com/articles/10.11…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sergey Ovchinnikov 🇺🇦

Sergey Ovchinnikov 🇺🇦 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @sokrypton

Jun 30
Weekend project: Comparing ESM3 from @EvoscaleAI to ESM2 and inv_cov. The ultimate test of a protein language models is how well the pairwise dependencies it learns correlate to structure. (1/8) Image
Traditional methods approx this signal by taking a multiple sequence alignment of a protein family and computing the inverse covariance matrix. For pLMs we extract it by computing a jacobian over the sequence track (for esm3, structure is masked). (2/8)
Each dot is a different protein family, I'm reporting contact accuracy for each, comparing invcov(msa) to cat_jac(esm(seq)). ESM3 is doing signficantly better at this task! (3/8) Image
Read 8 tweets
Jun 16
Towards the end of the presentation I went down a bit of a rabbit hole trying to demonstrate that AF3 may still be learning to invert the convariance matrix, which is needed to extract the coevolution signal from input multiple sequence alignment (MSA) (1/9).
For context, traditional methods like GREMLIN extract coevolution from input MSA. If you make the assumption that data is non-categorical, you can approximate the coevolution signal via inverse-covariance matrix (2/9). arxiv.org/abs/1906.02598
Image
The inverse can be computed by downweighting the largest eigenvectors by 1/eigenvalue.

Fun fact: the L2 regularization weight (aka shrinkage) in the previous slide is is used as a pseudo-count to avoid dividing by zero: 1/(eigenvalue + l2reg)

(3/9) Image
Read 9 tweets
Nov 27, 2023
A recent preprint from @Lauren_L_Porter shows that it's sometimes possible to sample the alternative conformation of metamorphic proteins by removing the MSA. Though I think this is a very interesting observation, I disagree that coevolution is not used when it is provided. (1/9) https://www.biorxiv.org/content/10.1101/2023.11.21.567977v2
We believe AlphaFold has learned some approximation of an "energy function" and a limited ability to explore. But this is often not enough to find the correct conformation, and often an MSA is required to reduce the search space. (2/9) https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.129.238101
For small single-domain monomeric proteins (that were in the training set) we see that alphafold often fails to predict from single sequence. Adding extra information (such as conservation [pssm or msa, but coevolution ablated via column shuffling]) helps. (3/9) Image
Read 10 tweets
Feb 27, 2023
Puzzle: The residue index encodes the position embedding for models like AlphaFold. This residue index is converted into an offset matrix. (1/3)
What do you think will happen if this offset matrix is used instead? [answer will be posted later] (2/3)
The N and C terminus form a peptide bond! (3/3)🤓
Read 4 tweets
Feb 26, 2023
AlphaFold inverted to hallucinate denovo proteins of up to 600 amino acids in length🤯

(animation below shows the designed protein docked into CryoEM density)

Exciting work with:
@chrisfrank662, @AKhoshouei, Yosta de Stigter, Dominik Schiewitz, @ShihaoFeng18, @hendrik_dietz
Read 5 tweets
Jan 30, 2023
We've been working on adding AlphaFold v2.3.1 support to ColabFold. 😎 Here is the notebook for those interested in testing: colab.research.google.com/github/sokrypt… (1/5)
The major update is AlphaFold_multimer_v3. This is an updated multimer model from @DeepMind. Initial tests from @milot_mirdita show an improvement over v2. Though, it's unclear whether improvements are due to new params or protocol (run for more recycles, with early stop). (2/5) Image
If you have GPU with bfloat16 support (any of the A* series), the model (for both ptm and multimer) was updated to use bfloat16 and fused triangle attention. This should significantly reduce GPU memory requirements allowing for the inference of larger proteins/complexes. (3/5) Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(