Some perspective on #AlphaFold with the rather eye-popping announcement from @DeepMind and @emblebi (which I am Director of) of all known proteins (~200 million) having an AlphaFold prediction run.
First off at some conceptual level this is "just" about the scaleability of computational methods - if you can do 1 in a computer, you can do 200 million. But still, instantiating this resource in a systematic manner makes it real. But this instantiation is not trivial.
There are three types of engineering needed here; one is knowing, tracking and organising all known proteins - this is mission of @uniprot, a joint project of @emblebi, @sib and PIR of @Georgetown. Conceptually looks pretty simple but ... bamboozling detail, reconcilation, etc
The second is being able to run AlphaFold on this number. AlphaFold was a tour de force in AI *development* and one of the most complex Deep Neural Networks around, but once developed, it is not the most demanding.
Still DeepMind and their access to whizzy Alphabet hardware (eg, TPUs) make this work, and of course, just to stress, the really impressive thing is the development of AlphaFold in the first place.
The third is storing, indexing, integrating and displaying the structure (alphafold.ebi.ac.uk) where the @PDBeurope team @emblebi lead by Sameer Velankar has been key. Easy-to-use websites are surprisingly - and counterintuitively - hard to make.
So - hats off to all involved; most obviously the @DeepMind AlphaFold team for developing it and the following through on open source code, data and results and the @UniProt and @PDBeurope teams @emblebi
Being able to "just download" the whole prediction set is going to - I am sure - stimulate entirely new research directions. As important is the on-demand "oh I am going to make a mutation on my protein, I wonder where it is on the structure" for ... *every known protein*
But - let's also step back and ask the question what made this transformation of this area of science by AI feasible. This starts with the vision from @demishassabis at @DeepMind that these sorts of problems can fall to AI - and that was a bold call now, let alone in 2010.
Demis is clearly a visionary at a number of levels, but the other clear thing is that he creates, motivates and praises a team with real depth - John Jumper being the most obvious for AlphaFold but DeepMind has a real ... deep bench of people. Very impressive.
All the AI talent in the world though can't easy solve science problems, posed by the universe, without data - and lots of it. Here the long established community norm in molecular biology for sharing data - in particular in structural biology here - is a key enabler.
Computational biology has two well springs; structural biology where the fiendishly complex 3D structure and the ways to study it- crystals and X-rays, and now Cryo-EM, demanded computers, and genomics/genetics, where the scale and complexity of the datasets is eye-watering
Structural biology was the forefront of computational biology in the 1970s, and unsurprisingly they took the "share your knowledge and reagents" mindset into "you must share the data". The community insisted on deposition of 3D coordinates to "The Protein Databank"
The Protein databank (PDB) was originally run in Long Island NY, associated with one of the DOE labs, Brookhaven National Laboratory. It evolved over time into the wwPDB (world-wide PDB) with partners in US @buildmodels, Europe @PDBeurope and Japan @PDBj_en.
One needs to remember this mindset of sharing data predates the internet - predates the creation of TCP/IP - and information then was shared by mailing exchanging tapes; FORTRAN and VAX machines ruled.
Due to the community norms of publish data on publication of paper, and the consistency of the stewardship of wwPDB of this global dataset; shared openly, used by all, is the bedrock of enabling #AlphaFold.
but... talented AI researchers and broad, excellent, unencumbered open data was probably not enough. In my view - and many other people's - it needed one more thing - a competition. This is CASP.
The protein folding problem is deceptively simple. Strings of amino acids, well represented by just a string of letters, are (usually) selected by evolution to fold into one specific structure which is stable enough, for example, that one can crystalise it as a regular 3D array.
There are only 20 amino acids. In the main chain of the polymer each amino acid has two bonds they can rotate around (given greek letters, Psi and Phi) and even this space is full of clear steric constraints.
The high solvation of water means that certain amino acids want to minimise their disruption of this solvation network (hydrophobic) whereas other amino acids want to participate (hydrophilic) and different amino acids have different shapes.
I remember writing doing some Psi/Phi angle stuff in a computer in the early 90s (I was hanging out in a Peter Campbell's NMR lab in Oxford) and I - along with many other people - thought... surely this can't be too hard. It is, afterall, "just physics" and it is constrained.
And indeed, cleverer people than me attacked this problem with bigger and bigger computers in the 90s with more and more sophistication and they made some eye catching claims of being able to solve this problem... which... when new structures came out... proved not to be true...
A number of structural biologists, lead by John Moult, decided to tame this mess and ran a competition - The Critical Assessment of Structure Prediction (CASP); John and colleagues canvased the experimental community for structures likely to be determined in the next year.
They posted this list (as amino acid sequences) to computational biologists and the biologists had to submit their predictions before *any human* had seen the result. True, blind prediction.
CASP had to rapidly learn how to do many things. To sort the easier problem (eg, closely related sequences) from the harder ("new fold"); how to assess how good a fit was; how to handle this number of predictions.
The early CASPs were a bit messy but this rapidly became *the* place to critically test (because in effect it was impossible to cheat) computational predictions - it is also formed a community of scientists chiselling away at this problem - defining it, making metrics etc
And this competition provided a simply excellent target for AI. This is both on the formal level - what is my objective function? Is there a metric for how right or wrong am I? But also on a social level - you just turned up and competed.
So - for me - AlphaFold needed three things (a) AI vision and talent, (b) broad, extensive, truly open data, provided by the community and stewarded over time and space (c) a considered and well run competition with open rules and metrics.
Are there more problems that fit this? Almost certainly yes - there is plenty of AI vision and talent, both in DeepMind but of course broader than that (a nod here to the academics such as @OliverStegle and @anshulkundaje as two such examples)
We have broad, stewarded datasets, from genomes @ensembl to expression values in cells @ExpressionAtlas to functional descriptions @UniProt to pathways @reactome to chemical/protein interactions @ChEMBL (to name a subset of @emblebi resources spanning molecular biology)
I think we need to add more of the formalised competitions - they sharpen the scientific question and help on both the technical side (what precisely are we trying to do? What is a continuous scoring metric?) and the social side (all comers, forming a community)
So I think there is plenty more for AI to get into in biology specifically, science more generally. When I have another break of time I might pen some thoughts on the interpretability of AI (basically... it is less scary than it looks in my view), but... the horizon here is wide.
Ugh - Iain Campbell (Oxford NMR) not Peter Campbell (Sanger Cancer Biologist). Doh!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
<sigh>. Another moment reading a paper where an author says "with the difference in outcomes between ethnic groups this shows that there are genetic components to this process". Nope. It does *not*.
Reminder: ethnicity (or race) is something you tick on forms from a societally defined set of categories; genetics is the variation of DNA you inherieted from your parents. Very very different things.
The only solid connection between these two concepts is that in most societies skin pigmentation is a salient feature for many ethnic groups, and skin pigmentation is driven by genetics and sun exposure as well for some genetics.
Great to see this paper by @OpenTargets team lead by @DunhamReal getting broad recognition. Human genetics is a massive and in theory near complete “natural perturbation experiment” which we can observe in humans
(The in theory is that every base in the human genome should be mutated at least once in the global population. A small minority of these will dominant lethals- eg preventing development - which interestingly we can often see in the inverse - a lack of observed mutation in a gene
Of course the “in theory” here implies a global genetics processes that can pick up and characterise every interesting phenotype - and ideally the inverse - ie global phenotyping and genotyping.
I don’t get this. At its core blockchain is write-only public database with a distributed (0 trust) commit scheme. Any software/regulation stack implies clients (individuals/ companies) trusting that software to represent the regulation on.ft.com/3ykoYjL
So … the trust needed in the system that everything works, money is safe, small investors won’t be screwed over for technical reasons of putting something on the appropriate chain wrong etc means there has to be … trust and regulation … so it isn’t a 0 trust scenario
Given that why not have a mutual organisation that runs a write only public database, guaranteed full downloads if desired to anyone and shows and audits it’s commit logs regularly to participants. You know a bit like those things called stock markets but databased
Having participated in many of the discussions about race/ethnicity and genetics both online and in person it has become clearer to me about a rather obvious fact: how the way race and ethnicity align with identity means this conversation is super-complex.
(this is, of course, a duh! moment. Of course this is related to identity! but before you think I am just stating the obvious I have found it interesting how identity *shapes the way people absorb information* in this space)
Just to get people on the same page: race or ethnicity is usually an identification process where people are given a number of boxes to choose from ("White British", "African-American","Han Chinese" etc) and they choose one, with some "mixed" or "other" box to use.
We still have mountains to climb for genomic medicine, but I want to pause, catch breath, and look around - how far we've come, and what does the landscape look like from here.
First some definitions - genomic medicine for me means genome-wide measurement ; germline genome is one, somatic (cancer) genome another; RNAseq a third. It's the comprehensive molecular measurement of a sample, which can applied for a lifetime (eg germline) or an instant
Genomic medicine merges with data driven medicine; perhaps rightly a strict subset of data driven medicine (you have to have good data processing to do genomic medicine) but has this key molecular measurement
I am not American, but do love America with its wonderful country (great cities; great wilderness; great small towns), great people from all stripes of life and just an uplifting can-do attitude. America - and Americans - at its/their best is... brilliant.
I first went to America when I was 19 - a "gap year" at Cold Spring Harbor Laboratory, with weekends up to Boston to meet with my friends up there. Working hard on science, playing hard in Boston and NYC; I learnt everything from running gels to C programming to keg parties.
I spent a life-broadening summer in Baltimore when I was 24, as an intern in the Mayor's office (Kurt Schmoke), learning about US city politics and administration first hand, from investment banks to policing east Baltimore. Quite an eye opener.