#Alphafold by #deepmind used solid interdisciplinary intuitions for algorithm/model design. It wasn't just a rinse-and-repeat machine learning exercise. Details on methods are limited, but here's my best interpretation (+some predictions) so far: [1/n]
Protein sequence databases provide us samples that have defacto passed the fitness test of evolution and are information-rich. "Genetics search" is a retrieval step to find nearest-neighbors as defined by sequence alignment. Why do we need nearest-neighbors (NNs), you ask?
There's a neat principle/intuition called coevolution that can help explain. The mutational variance observed can give clues to protein structure and function. Read more here: gremlin.bakerlab.org/gremlin_faq.php
The retrieval step is critical to the success of #alphafold and it relies on decades of scientific advances in cost-effective protein sequencing, curation of protein databases (including metagenomics in BFD), and efficient search software [such as developed by @thesteinegger]
Next up, the "embed" step. Essentially we need to transform all the protein sequences into vectors in a useful embedding space. #deepmind hasn't provided any details. But it's worth mentioning as an aside that this is extensively studied in ML and more recently in protein ML too
[Further tangent] you can use a transformer-based representation for embedding protein sequences. Intuitions from self-supervised learning (eg masked/autoregressive language modeling) on large-scale raw databases are used to do this well. #alphafold may or may not have used this
Now comes the more novel and unexplained work... learning the "sequence-residue edges" and "residue-residue edges". FYI a residue refers to the building block (eg token) of proteins. There are a couple intuitions and inductive biases that were broadly involved:
An attention-based technique was used, which has shown promise across ML in language/vision/etc. This allows for efficient learning (ie capturing relations between elements) and uncovering broader principles: lilianweng.github.io/lil-log/2018/0…
Also architecturally, they allow for the model to attend to the related sequences (the MSA) found in "genetics search". In an iterative fashion, residue-residue interactions/edges are updated with the residue information learned across evolutionary related sequences.
The residue-residue edges/interactions can then be transformed into distance matrices which describe the pairwise distances between each building block of a protein... essentially the crux of the protein structure prediction problem
The predicted protein structure is further refined through intuitions/tools derived in biophysical modeling [Amber force fields] that describe the physics between interacting particles in a system. This translates to small movements in the overall structure to ensure viability
Lastly, the model is differentiable and trained end-to-end (as opposed to a sequential pipeline of specialized models) ... allowing for loss propagation "holistically" through all internal representations.
LMK if there's something I missed or if you have any further interpretations! Even with the limited detail provided, it's a great case study into building cool models with solid intuitions. There's so much more work to be done in proteins, this is the first step! [n/n]
• • •
Missing some Tweet in this thread? You can try to
force a refresh