1/ AlphaFold is a revolutionary leap in biology. This has gifted us the AlphaFold Database (AFDB). But what happens when we use this data to train other models? We found a crucial catch. 🧵
2/🔬 The Challenge: Systematic Bias The problem isn't AlphaFold's accuracy—it's phenomenal. The issue is that the AFDB has a systematic bias. The structures are "too perfect" and don't capture the full, messy diversity of experimentally-determined structures from the PDB.
3/💡 The Evidence: A Drop in Performance
We saw this clearly in inverse folding. Models trained on PDB data generalize well. But the exact same models trained on AFDB data struggle, with performance dropping by up to 26.6% when tested on real-world structures!
4/🔎 Visualizing the Bias
Ramachandran plots of PDB structures (left) show a broad, natural variation. AFDB structures (middle) are tightly clustered in "allowed" regions.
The mixed plot (right) shows how AFDB conformations occupy a narrower, more idealized subspace.
We built DeSAE, trained only on experimental PDB data.
By learning to reconstruct native structures from corrupted inputs, DeSAE implicitly learns the manifold of natural, physically plausible protein conformations.
6/💥 The Result: Performance Recovered!
The impact is dramatic. Training inverse folding models on Debiased AFDB leads to massive performance gains.
For PiFold, this debiasing step recovered most of the lost performance, with sequence recovery jumping by an incredible +18.5%!
• • •
Missing some Tweet in this thread? You can try to
force a refresh