Simon Barnett Profile picture
Jun 16 23 tweets 9 min read
The following paper is one of the most interesting and thoughtful I've read in quite some time.

The authors offer a new framework for understanding if genetic mutations are harmless (benign) or dangerous (pathogenic).

Spoiler: AlphaFold 2 is involved.

nature.com/articles/s4146…
First, I'd like to cite the authors; .@capra_lab, .@rodendm, and .@computbiolgeek. I'm sure they'll correct me if I butcher anything.

I originally stumbled on the paper after reading a thread about it by .@RyanDhindsa, which I've linked below:

Some Background:

The basis of medical genetics is understanding how #DNA mutations (variants) give rise to disease.

Recall that inherited DNA sequence variants can sometimes alter proteins by changing the identity of an amino acid.

This is called a 'missense mutation'. Image
In the above example, the normal (wild-type) DNA sequence reads ...CAT..., which codes for the amino acid histidine. The mutated version reads ...CGT..., which replaces histidine with a different amino acid called arginine.

Recall that proteins = amino acids strung together. ImageImage
Extrapolating from the example, the million-dollar question is whether a missense variant (and the resulting protein variant) is benign or pathogenic.

In today's age, most single-nucleotide variants (SNVs) we find in patients are variants of uncertain significance (VUSs).
This means the variant is a bit like Schrödinger's Cat. Until we have enough information to place it in the benign/likely-benign (B/LB) or pathogenic/likely pathogenic (P/LP) bins, it could be either.

Fortunately, most VUSs are downgraded to B/LB.

ncbi.nlm.nih.gov/pmc/articles/P…
To label a variant along the pathogenicity spectrum, large clinical labs use an ensemble of methods ranging from a simple literature review (who's seen this before) to physically recreating the mutation in a lab experiment to see what happens.

gimjournal.org/article/S1098-… Image
Once they confidently interpret a variant as benign or pathogenic, labs often submit this 'truth label' to open-source databases (e.g. ClinVar), which other labs can reference if they happen to find the same mutation in one of their patients.

Still, think about how large sequence space is. Every one of us has ~6 billion base pairs of DNA in our cells. The exact letter change, the location of the letter change, and the superposition of many letter changes in concert can create enormous complexity.
SNVs are also only one type of variant - there are insertions/deletions (indels), large structural variations, and many other types that are beyond the scope of this thread.

Sufficed to say - we need better tools to find if SNVs are dangerous, even if not in a database.
Before describing their new method (COSMIS), the authors walk through several contemporary approaches to interpret SNVs.

1. Interspecies population frequency = Is the sequence conserved in related species across long timespans? If so, it's important that it stay unmutated. Image
2. Intraspecies population frequency = In humans, is the variant extremely rare? If so, it's probably dangerous to carry because evolution selected against it. This reminds me a bit of the 'survivorship bias' plane.

deanyeong.com/article/surviv…
Still, population-based measures look at correlation only. They don't tell you anything about HOW a protein is altered by an upstream DNA sequence change. Nor do they give insight into how a protein variant then gives rise to a disease state.
3. Some newer, functionally-informed metrics overcome the limitations of statistical approaches but may be low resolution. That is, they can say if a gene or region of a gene is essential, but don't go down to the amino-acid level. Image
Many of these tools consider the 1D sequences of genes or proteins but don't necessarily consider that while these molecules can be represented as a string of characters - they are, in fact, dynamic 3D entities that interact with their environments.
As shown below, the authors' framework (COSMIS) analyzes missense-mutated proteins in 3D. Here, they define a 'contact set' as amino acids that are physically close in 3D space.

Notice that j2 and j5 are close in 1D (sequence) space, but don't interact in 3D (real) space. ImageImage
Why does that matter?

Remember earlier we talked about a missense variant switching two (+) charged amino acids (H>R)?

Consider what would happen if the switch was to a (-) charged acid. The opposite charges may attract, perhaps misfolding it in a destructive way.
Conceptually, the authors took millions of known SNVs, mapped those variants to (1D) protein sequences, and then overlaid those on 3D protein structures.

Some structures were experimentally determined, but many were simulated using .@DeepMind's AlphaFold 2 (AF2). Image
By subtracting the # of observed missense variants from the expected # (and dividing by the standard deviation of the expected distribution), the authors created the COSMIS score.

The lower it is, the more likely a missense variant is pathogenic (dangerous). Image
Not only does COSMIS outperform other 1D and 3D-aware variant interpretation metrics, but it carries complementary information to phylogenetic (population) models, as shown in (a) and (c) below, respectively. Image
Altogether, this work shows how thinking in 3D helps scientists and clinicians interpret the functional consequences of DNA sequence changes.

Yes, DNA, RNA, and proteins are information, but they're also fragile, physical things.
Each of the 20 amino acids is unique in size, shape, charge, and other qualities. By considering how they interact in 3D space, COSMIS better predicts how the overall function of a protein changes.

While not a panacea, it's a great step towards linking sequence and effect.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Simon Barnett

Simon Barnett Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @sbarnettARK

Jan 18
Now that @Quantum_Si has given us a peek under the hood of its protein #sequencing platform (Platinum), we can begin comparing actual results to theory.

A few months ago, I shared this paper that gave a theoretical framework for protein sequencing:
pubs.acs.org/doi/10.1021/ac…
The author simulated how different factors, such as the # of readable amino acids (AAs) and the read length, would affect a protein sequencer's ability to unambiguously detect the 20,000 canonical human proteins in our bodies.

That chart is attached below.
I've marked in green where QSI currently stacks up. Based on its recent pre-print (linked below), Platinum can directly read seven (7) amino acids (F, Y, W, L, I, V, and R) with peptide reads that seem to max out around 20 AAs.

biorxiv.org/content/10.110…
Read 10 tweets
Dec 16, 2021
A recent publication by Dennis Lo et al applied long-read sequencing (LRS) in the prenatal screening (#NIPT) setting. It's a rather unorthodox technology/application pairing, and it's got me scratching my head a bit.

Open Acces Link:

pnas.org/content/118/50…
For context, earlier this year, Lo et al published a convolutional neural network ("the HK model") that enabled PacBio LRS devices to read methylation (5mC) across the entire genome with very high fidelity. This is important later.

What's methylation?
PDF of HK Model Paper:
pnas.org/content/pnas/1…

I'll summarize my main takeaways from the current paper and end with some of my open questions/concerns.
Read 30 tweets
Dec 3, 2021
@MJLBio @Sanctuary_Bio @Biohazard3737 Sure! I realize I was being a little vague with those statements. Generally, I think you're correct in your interpretation of the importance of P2 (great $/GB, but at a smaller scale) as well as duplex sequencing.

Something that is important to recognize, though ...
@MJLBio @Sanctuary_Bio @Biohazard3737 ... is how product deployment works differently between PacBio and Nanopore, which is partly an artefact of culture and of time in the public markets, in the public markets. I'm not advocating for one over the other with my next statements.
@MJLBio @Sanctuary_Bio @Biohazard3737 PacBio has been a public company for a long time. While the management has changed much since the failed Illumina merger, the familiarity with how to operate as a public company has not.

PacBio is more secretive and only unveils fully built-out commercial products.
Read 5 tweets
Sep 16, 2021
I'd like to share my initial reaction to today's Berkeley Lights report. But first, I need to do some housekeeping. I can't comment on stock movements, share financial projections, or debate fair value.

Please see our general disclosure: ark-invest.com/terms/#twitter
Generally, I respect anyone who's put this much work into a topic. I won't pretend to have a clean rebuttal to every point. In my experience, beyond the hyperbole and hasty generalizations, there is some truth in these types of reports.

I want to soberly appraise those truths.
Also, I'd invite the subject-matter experts waiting in the wings to build off of this thread, add detail, or share their experiences. Ultimately, we're all after the same thing.

I will start with a few concessions and end with a few counterpoints to today's report:
Read 28 tweets
Sep 14, 2021
What is lead-time bias in #cancer screening?

Imagine that a meteor was hurtling through space towards the Earth. Its speed and trajectory indicate that it will destroy the planet in approximately 10 years.

Now, let's say that our best sensors are only ...
... capable of seeing said meteor 1 year in advance. So, 9 years go by and we are blissfully unaware of our impending doom. Then, at the 9-year mark, we detect the meteor and measure our remaining survival time to be just 1 year.
What if I gave you a better sensor? What if this sensor could see the meteor from 10 years away instead of just 1?

How long would our survival time be? While we may have a 10-year lead time instead of a 1-year lead time, the meteor still strikes us on the same day.
Read 9 tweets
Aug 17, 2021
As short-read #sequencing (SRS) costs begin to drop again, undoubtedly fueled by a resurgence in competition, I suspect many liquid biopsy providers will add blood-based whole-genome sequencing (WGS) to supplement, or replace, the deep targeted sequencing paradigm.
With a few exceptions, most clinical-stage diagnostic companies build patient-specific panels by sequencing the solid tumor, then downselecting to a few dozen mutations to survey in the bloodstream.

I don't think this approach is going anywhere anytime soon.
However useful, this deep-sequencing approach suffers from several challenges:

1. It requires access to tissue.
2. It requires the construction of patient-specific PCR panels.
3. It requires significant over-sequencing ($$$).
4. It introduces a third layer of error (PCR).
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(