Jason Crawford Profile picture
Dec 1, 2020 31 tweets 8 min read Read on X
Today Google @DeepMind announced that their deep learning system AlphaFold has achieved unprecedented levels of accuracy on the “protein folding problem”, a grand challenge problem in computational biochemistry.

What is this problem, and why is it hard?

deepmind.com/blog/article/a…
I spent a couple years on this problem in a junior role in the early days of @DEShawResearch, so it's close to my heart. DESRES (as we called it internally) took an approach of building a specialized supercomputing architecture, called Anton: Image
Proteins are long chains of amino acids. Your DNA encodes these sequences, and RNA helps manufacture proteins according to this genetic blueprint.

Proteins are synthesized as linear chains, but they don't stay that way. They fold up in to complex, globular shapes. Image
One part of the chain might coil up into a tight spiral called an α-helix. Another part might fold back and forth on itself to create a wide, flat piece called a β-sheet. Image
The sequence of amino acids itself is called primary structure. Components like this are called secondary structure.

Then, these components themselves fold up among themselves to create unique, complex shapes. This is called tertiary structure. Image
This looks like a mess. Why does this big tangle of amino acids matter?

Protein structure is not random! Each protein folds in a specific, unique, and largely predictable way that is essential to its function. Image
The physical shape of a protein gives it a good fit to targets it might bind with. Other physical properties matter too, especially the distribution of electrical charge within the protein, as shown here (positive charge in blue, negative in red): Image
If a protein is essentially a self-assembling nanomachine, then the main purpose of the amino acid sequence is to produce the unique shape, charge distribution, etc. that determines the protein's function.
*How* exactly this happens, in the body, is still not fully understood, and is an active area of research. In any case, understanding structure is crucial to understanding function.
But the DNA sequence only gives us the primary structure of a protein. How can we learn its secondary and tertiary structure—the exact shape of the blob?

This problem is called “protein structure determination”, and there are two basic approaches: measurement and prediction.
Experimental methods can measure protein structure. But it isn't easy: an optical microscope can't resolve the structures.

For a long time, X-ray crystallography was the main method. NMR has also been used, and more recently, a technique called cryogenic electron microscopy. Image
But these methods are difficult, expensive, and time-consuming, and they don't work for all proteins.

Notably, proteins embedded in the cell membrane—such as the ACE2 receptor that COVID-19 binds to—fold in the lipid bilayer of the cell and are difficult to crystallize. Image
Because of this, we have only determined the structure of a tiny percentage of the proteins that we've sequenced. Google notes that there are 180M protein sequences in the Universal Protein database, but only ~170k structures in the Protein Data Bank.

We need a better method.
Remember, though, that the secondary and tertiary structures are mostly a function of the primary structure, which we know from genetic sequencing.

What if, instead of *measuring* a protein's structure, we could *predict* it?
This is “protein structure prediction”, or colloquially, the “protein folding problem.” Computational biochemists have been working on it for decades.

How could we approach this?
The obvious way is to directly simulate the physics. Model the forces on each atom, given its location, charge, and chemical bonds. Calculate accelerations and velocities based on that, and evolve the system step by step. This is called “molecular dynamics” (MD).
The problem is that this is *extremely* computationally intensive.

A typical protein has hundreds of amino acids, which means thousands of atoms. But the environment also matters: the protein interacts with surrounding water when folding. So it's more like 30k atoms to simulate.
And there are electrostatic interactions between every pair of atoms, so naively that's ~450M pairs, an O(N²) problem. (There are smart algorithms to make this O(N log N).)

Also IIRC you end up needing to run for something like 10^9 to 10^12 timesteps.

It's a pain.
OK, but we don't have to simulate the entire folding process.

Another approach is to find the structure that *minimizes potential energy*. Objects tend to come to rest at energy minima, so this is a good heuristic. The same model that gives us forces for MD can calculate energy.
With this approach, we can try a whole bunch of candidate structures and pick the one with lowest energy.

The problem, of course, is where do you get the structures from? There are just way too many—molecular biologist Cyrus Levinthal estimated 10^300 (!)
Of course, you can be much smarter than trying all of them at random. But there are still too many.

So there have been many attempts to get faster at doing these kinds of calculations.
Anton, the supercomputer from @DEShawResearch, used specialized hardware—a custom integrated circuit. IBM also has a computational bio supercomputer, Blue Gene. Image
Stanford created Folding@Home to leverage the massively distributed power of ordinary home computers.

The Foldit project from UW makes folding a game, to augment computation with human intuition. Image
Still, for a long time, no technique was able to predict a wide variety of protein structures with high accuracy. A biannual competition called CASP, which compares algorithms against experimentally measured structures, saw top scores of 30–40%… until recently: Image
So how does AlphaFold work? It uses multiple deep neural nets to learn different functions relevant to each protein. One key function is a prediction of the final *distances* between pairs of amino acids. This guides the algorithm to the final structure.
In one version of the algorithm, they then derived a potential function from this prediction, and applied simple gradient descent—which worked remarkably well. (I can't tell from what I've been able to read today if this is still what they're doing.)
A general advantage of AlphaFold over some previous methods is that it doesn't need to make assumptions about the structure. Some methods work by splitting the protein into regions, figuring out each region, then putting them back together. AlphaFold doesn't need to do this.
@DeepMind seems to be calling the protein folding problem solved, which strikes me as simplistic, but in any case this appears to be a major advance. Experts outside Google are calling it “fantastic”, “gamechanging”, etc.

sciencemag.org/news/2020/11/g…
Between protein folding and CRISPR, genetic engineering now has two very powerful new tools in its toolbox. Maybe the 2020s will be to biotech what the 1970s were to computing.

Congrats to the researchers at @DeepMind on this breakthrough!
Blog post version of this thread, with image credits, at @rootsofprogress: rootsofprogress.org/alphafold-prot…
PS: the impact AlphaFold can have on pharmaceuticals

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jason Crawford

Jason Crawford Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jasoncrawford

Oct 24
The steam engine was invented in 1712. An observer at the time might have said: “The engine will power everything: factories, ships, carriages. Horses will become obsolete!”

And they would have been right—but two hundred years later, we were still using horses to plow fields.Image
In fact, it took about a hundred years for engines to be used for transportation, in steamboats and locomotives, both invented in the early 1800s. It took more than fifty years just for engines to be widely used in factories.

What happened? Many factors, including: Image
Image
1. The capabilities of the engines needed to be improved. The Newcomen engine created reciprocal motion, good for pumping but not for turning (e.g., grindstones or sawmills). Improvements from inventors like James Watt allowed steam engines to generate smooth rotary motion.
Read 16 tweets
Sep 19
This is a prescription for re-enslaving women to domestic service, and ensuring that only the wealthy can live with the basic dignity of cleanliness.

What is described here is exactly how we used to do laundry, and it was terrible. Laundry was difficult manual labor that took up an entire day of the week, and was part of why being a housewife was a full-time job.

To quote a scholar who actually knows this topic (Ruth Schwartz Cowan, More Work for Mother):

“For most women, for most of the year, the labor of doing laundry meant carrying heavy buckets of water from tap to stove and from stove to tub, repeatedly overturning the tubs and refilling them, as well as carrying, scrubbing, wringing, and hanging the heavy fabrics that were the only ones cheap enough for poor people to buy. The labor of getting the family bathed was similar, lacking only the carrying, scrubbing, wringing, and hanging of the wash.…

“Even if a dwelling contained a sink, it was usually not deep enough for doing laundry and may not have had a drain. Public laundries were few and far between; and so, for that matter, were public bath houses…

“The net result of the profound difficulty that washing and bathing presented was that precious little of it got done; underclothing might be changed only once a week, or even once a season; sheets likewise (if they were used at all, since featherbeds did not require them); outerclothes might do with just a brushing; shirts or shirtwaists might go for weeks without benefit of soap; faces and hands might get splashed with water once a day; full body bathing might occur only on Saturday nights (and then with a sponge and a wooden tub and water that was used and reused) or only when underwear was changed—or never at all.

“‘Some women have a feeling that cleanliness is a condition only for the rich,’ one home economist remarked of the immigrant women with whom she worked…”

Cowan also quotes an early 20th-century writer as saying:

“Many people do not sufficiently realize the extent to which the increase in cleanliness of home and person contributes toward the growth of democracy. So long as the upper classes felt the necessity of using smelling salts whenever approached by one of the common people, just so long would they despise the vile-smelling yokels. Cleanliness is not only next to Godliness, but it is essential to the establishment of the Brotherhood of Man.”Image
If you don't believe history, maybe you'll believe someone who's lived this.

“Routine tasks, like scrubbing clothes, are such a waste of humans' capacity for creativity and innovation”

Listen to @_alice_evans:

@_alice_evans Some more “lived experience” of hand-washing laundry, for your consideration:
Read 4 tweets
Feb 2
Academia cares whether an idea is new. It doesn't really have to work

Industry only cares if an idea works. Doesn't matter if it's new

This creates a gap. Actually a few gaps:
1. It creates a culture gap

Academics look at industry people trying to get an idea to work, and complain, “they aren't doing anything new!”

2. It creates a gap in the path from idea to reality, aka the Valley of Death

Academics are done once a concept is demonstrated. Industry doesn't want to fund an idea before it is working/viable.

In between is the idea that is no longer new but does not yet work Image
Read 20 tweets
Dec 18, 2023
If “low-hanging fruit” or “ideas getting harder to find” was the main factor in the rate of technological progress, then the fastest progress would have been in the Stone Age.

Ideas were *very easy to find* in the Stone Age! There was *so much* low-hanging fruit! Image
Instead, the pattern we see is the opposite: progress accelerates over time. (Note that the chart below is *already on a log scale*)

Clearly, there is some positive factor that more than makes up for ideas getting harder to find / low-hanging fruit getting picked. Image
“Ideas getting harder to find” is ambiguous, let me clarify.

In the econ literature it refers to a specific phenomenon, which is that it takes exponentially increasing R&D investment to sustain exponential growth. This is basically all the low-hanging fruit getting picked.
Read 11 tweets
Jul 5, 2023
Suppose you give an AI an innocuous-seeming goal, like playing chess, fetching coffee, or calculating digits of π. What could go wrong?

Well, there is an argument that even “safe” goals for AI could be very dangerous.

I'm going to give the argument—and then push back on it.
This thread is adapted from an essay here, in case you prefer that format: rootsofprogress.org/power-seeking-…
So the argument goes like this. For any goal:

• The AI can do better at the goal if it can upgrade itself
• It will fail at the goal if it is shut down or destroyed (“you can’t get the coffee if you’re dead”)
• Less obviously, it will fail if anyone ever *modifies* its goals
Read 38 tweets
Jun 21, 2023
There is an AI doom argument that goes, in essence:

1. Sufficiently advanced AI will be smarter than us
2. Anything smarter than us, we cannot control
3. Having something in the world that we cannot control would be bad

∴ Sufficiently advanced AI would be bad. QED
One counter is to deny (1), eg: AI will never be that smart; intelligence is multi-dimensional and it doesn't make sense to compare them; super-human intelligence is so far in the future that we shouldn't worry about it; etc

This is becoming less popular recently as AI advances.
Another counter is to deny (2): we can build superintelligent systems, but have them be our tools or servants.

This is probably most popular among techno-optimists.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(