#Alphafold by #deepmind used solid interdisciplinary intuitions for algorithm/model design. It wasn't just a rinse-and-repeat machine learning exercise. Details on methods are limited, but here's my best interpretation (+some predictions) so far: [1/n]
Protein sequence databases provide us samples that have defacto passed the fitness test of evolution and are information-rich. "Genetics search" is a retrieval step to find nearest-neighbors as defined by sequence alignment. Why do we need nearest-neighbors (NNs), you ask?
There's a neat principle/intuition called coevolution that can help explain. The mutational variance observed can give clues to protein structure and function. Read more here: gremlin.bakerlab.org/gremlin_faq.php
The retrieval step is critical to the success of #alphafold and it relies on decades of scientific advances in cost-effective protein sequencing, curation of protein databases (including metagenomics in BFD), and efficient search software [such as developed by @thesteinegger]
Next up, the "embed" step. Essentially we need to transform all the protein sequences into vectors in a useful embedding space. #deepmind hasn't provided any details. But it's worth mentioning as an aside that this is extensively studied in ML and more recently in protein ML too
[Further tangent] you can use a transformer-based representation for embedding protein sequences. Intuitions from self-supervised learning (eg masked/autoregressive language modeling) on large-scale raw databases are used to do this well. #alphafold may or may not have used this
Now comes the more novel and unexplained work... learning the "sequence-residue edges" and "residue-residue edges". FYI a residue refers to the building block (eg token) of proteins. There are a couple intuitions and inductive biases that were broadly involved:
An attention-based technique was used, which has shown promise across ML in language/vision/etc. This allows for efficient learning (ie capturing relations between elements) and uncovering broader principles: lilianweng.github.io/lil-log/2018/0…
Also architecturally, they allow for the model to attend to the related sequences (the MSA) found in "genetics search". In an iterative fashion, residue-residue interactions/edges are updated with the residue information learned across evolutionary related sequences.
The residue-residue edges/interactions can then be transformed into distance matrices which describe the pairwise distances between each building block of a protein... essentially the crux of the protein structure prediction problem
The predicted protein structure is further refined through intuitions/tools derived in biophysical modeling [Amber force fields] that describe the physics between interacting particles in a system. This translates to small movements in the overall structure to ensure viability
Lastly, the model is differentiable and trained end-to-end (as opposed to a sequential pipeline of specialized models) ... allowing for loss propagation "holistically" through all internal representations.
LMK if there's something I missed or if you have any further interpretations! Even with the limited detail provided, it's a great case study into building cool models with solid intuitions. There's so much more work to be done in proteins, this is the first step! [n/n]

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ali Madani

Ali Madani Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!