Tweet

Ali Madani

30 Nov, 13 tweets, 4 min read

#Alphafold by #deepmind used solid interdisciplinary intuitions for algorithm/model design. It wasn't just a rinse-and-repeat machine learning exercise. Details on methods are limited, but here's my best interpretation (+some predictions) so far: [1/n]

Protein sequence databases provide us samples that have defacto passed the fitness test of evolution and are information-rich. "Genetics search" is a retrieval step to find nearest-neighbors as defined by sequence alignment. Why do we need nearest-neighbors (NNs), you ask?

There's a neat principle/intuition called coevolution that can help explain. The mutational variance observed can give clues to protein structure and function. Read more here: gremlin.bakerlab.org/gremlin_faq.php

@thesteinegger

The retrieval step is critical to the success of #alphafold and it relies on decades of scientific advances in cost-effective protein sequencing, curation of protein databases (including metagenomics in BFD), and efficient search software [such as developed by @thesteinegger]

Next up, the "embed" step. Essentially we need to transform all the protein sequences into vectors in a useful embedding space. #deepmind hasn't provided any details. But it's worth mentioning as an aside that this is extensively studied in ML and more recently in protein ML too

[Further tangent] you can use a transformer-based representation for embedding protein sequences. Intuitions from self-supervised learning (eg masked/autoregressive language modeling) on large-scale raw databases are used to do this well. #alphafold may or may not have used this

Now comes the more novel and unexplained work... learning the "sequence-residue edges" and "residue-residue edges". FYI a residue refers to the building block (eg token) of proteins. There are a couple intuitions and inductive biases that were broadly involved:

An attention-based technique was used, which has shown promise across ML in language/vision/etc. This allows for efficient learning (ie capturing relations between elements) and uncovering broader principles: lilianweng.github.io/lil-log/2018/0…

Also architecturally, they allow for the model to attend to the related sequences (the MSA) found in "genetics search". In an iterative fashion, residue-residue interactions/edges are updated with the residue information learned across evolutionary related sequences.

The residue-residue edges/interactions can then be transformed into distance matrices which describe the pairwise distances between each building block of a protein... essentially the crux of the protein structure prediction problem

The predicted protein structure is further refined through intuitions/tools derived in biophysical modeling [Amber force fields] that describe the physics between interacting particles in a system. This translates to small movements in the overall structure to ensure viability

Lastly, the model is differentiable and trained end-to-end (as opposed to a sequential pipeline of specialized models) ... allowing for loss propagation "holistically" through all internal representations.

LMK if there's something I missed or if you have any further interpretations! Even with the limited detail provided, it's a great case study into building cool models with solid intuitions. There's so much more work to be done in proteins, this is the first step! [n/n]

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Ali Madani

Try unrolling a thread yourself!

Did Thread Reader help you today?

Like this author's thread?