A Thread (🧵)

Let's discuss long-read #sequencing, optical mapping, and the implications of a recent study (linked below).

Please view my disclosures at the end. I've intentionally made this thread more technical as I think it's necessary.

First, let's deconstruct the paper's title: "De Novo Assembly of 64 Haplotype-Resolved Human Genomes of Diverse Ancestry and Integrated Analysis of Structural Variation".

De novo (Latin: "Of New") assembly involves sequencing a genome without the help of a reference.
Assembling a #genome de novo is like solving a jigsaw puzzle without using the picture on the front of the box. You could start with the corners, assemble the edges, and try to fill in the rest using color- or shape-matching methods.

Let me use an analogy.
Imagine the jigsaw pieces are sequence reads, the heuristics you use to put the pieces together are assembly algorithms, and the final product (hopefully) is a complete reference genome, as shown in the drawing below. I'll extend this farther.
Most sequencing that goes on in the world is done with a reference just as most jigsaw puzzles are finished with the box nearby. That is to say, de novo sequencing is a substantially smaller market opportunity than #clinical sequencing.
Clinical sequencing works by comparing a patient's genome up to a reference genome. Intuitively, the differences are called #variants, as shown below. Detecting variants using a reference genome is easier the same way that solving a jigsaw using a photo is easier.
My point here, which I will reinforce later, is that conclusions drawn from this paper are most applicable to de novo sequencing, not clinical sequencing. This distinction will resurface again, so keep the jigsaw puzzle analogy in your back pocket.
What is a #haplotype-resolved genome?

Humans are #diploid organisms, meaning we have two copies of each gene - one from each parent. Each parental copy is a haplotype. When a variant is haplotype-resolved, that means we know which parent it was inherited from.
The process of isolating variants to each parental haplotype is called phasing. Unlike a 'squashed' short-read assembly, a haplotype-resolved, long-read assembly tells researchers more about the functional consequences of genetic variants (also called annotation).
Importantly, the researchers' method was focused on an ethnically-diverse group of persons. By generating more diverse reference genomes, we can improve variant calling performance for everyone. A dangerous variant in one person could be benign for another!

Let's shift gears.
I'm now going to discuss several methods and instruments, some of which were used in this paper. As a reminder, methods and instruments are NOT companies. Please don't misinterpret my opinions/critiques of technologies/tools as my feelings towards a company.
Similar to other de novo approaches, the researchers' method combined multiple sequencing approaches. I've outlined them below. This is common in de novo research settings where discovery rates are optimized over cost, turnaround time, or ease of use.
While this 'kitchen-sink' approach may apply to de novo sequencing, rarely have I ever seen a clinical setting wherein genomic data is fed from one sequencer to another. This is another reason why I think it's improper to generalize cost-comparisons derived from this paper.
The pooling of methods may be why some believe LRS to be 20-40X more expensive (generally) than optical mapping, which is incorrect. For example, #WGS on the Sequel IIe is quoted at ~$3.7K, not $10-20K, which is how it would be used in a clinical setting (see cost-decline below).
Let's now discuss the study's results.

Across all samples, the researchers yielded 18,207,906 unique variants that break down in the following manner (see illustration below). We'll begin with the smallest variants (SNVs) and end with the largest structural variants (SVs).
Recall there are ~3B base pairs (bps) in the human genome. SNVs are single letter (1 bp) change. An indel (portmanteau of insertion + deletion) is <49 bps. SVs vary widely in size, but begin at >50 bps up to tens-of-thousands of bps in size.
You can see the vast majority of variation is small. Though rarer SVs often impart larger functional changes. SVs tend to be highly repetitive (AAA) and/or contain repetitive motifs (ATGATGATG). This causes short-reads to fail, as referenced multiple times in the print.
HiFi, CLR, sequencing by synthesis (SBS), and #nanopore sequencing all aim to detect the maximum # of variants, though with varying degrees of success. I won't compare them here except to say we think short-reads are ineffective, even w/ improved graph-based assemblers.
As shown here, the researchers' method fully resolved nearly all variants, except for 28% of large SVs detected via OM. Though, if you consider partially-detected calls, OM adds only 6% more clarity. In fairness, I'll stick with the 28% because it doesn't change my point.
The conclusion should be that the HiFi/CLR + SS method identified >99% of the variation in this study, with OM adding an additional 28% of 1%, or objectively 0.028% more variant calls. OM isn't capable of calling variants <500 bp, which is 99% of the total variation.
This does NOT mean that OM isn't important, needed, or exciting. I think it's all three, but to say that LRS detects only 72% of SVs as opposed to OM is incorrect. To say that LRS detects just 72% of LARGE SVs is more accurate, but omits where LRS technology is going.
We disagree outright that sequence reads aren't getting longer as it's clearly an R&D goal for both fluorescence and nanopore-based LRS providers. I'm unaware of the theoretical 'upper limit' for read-lengths, especially for nanopore-based methods.
In fact, the recent telomere-to-telomere (T2T) assembly of the most complete reference genome only used HiFi & ultra-long nanopore reads for long-range scaffolding, as shown below. I'd argue sufficiently longer reads will improve large SV calling.

Still, I agree in principle that massively complex SVs, segmental duplications (SDs), regions near centromeres, and very heterogeneous cancer samples w/ low #allele fraction are perfectly cut out for OM-first approaches.
Indeed, I'm equally excited for upcoming presentations on these areas in a few weeks. I agree, in principle, that OM will likely always outperform LRS for detecting the most esoteric SV events, though that margin may erode with time.
I've even written about some of these cases, such as chromothripsis, where I now believe OM has a key role to play in discovery. This print also shows how OM is vital to generating haplotype-resolved, de-novo genomes.

I'll admit another place where my knowledge is opaque. What does multiplexing look like on Saphyr? Throughput is one thing, which again I'll admit has improved a shocking amount, but can a lab do targeted work using UMI's for sample batching?
In conclusion, there's no single measure for 'better' in genomics. That's like saying a racecar is better than a Jeep Wrangler. It all depends on how fast you need to go and what rocks you may need to climb.

General: bit.do/eyRo8

I'm focused on technology, products, and methods, not comparing companies. This is not a buy, sell, or hold recommendation for any security.
Disclosure(s) Cont.

To YouTubers/vloggers, please do not spread rumors. Instead, try to hear out my angle, come up with your own take, or give insight where I may be wrong so I can learn.
Edit: Should be 0.28%, not 0.028% more large SV calls.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Simon Barnett

Simon Barnett Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @sbarnettARK

18 Nov 20
Long-read #NGS obeys Wright's Law.

For every cumulative doubling in sequence data generated across its install base, @PacBio has been able to lower (consumables) costs by roughly 30%, as shown below.

What could this imply about the future of long-read #sequencing?
First, let's acknowledge a Catch-22. Does PacBio need to (a) derive knowledge from platform utilization to lower sequencing costs or (b) lower costs first in order to unlock greater platform utilization?

At present, we believe it's more of the latter. Why?
PacBio's HiFi chemistry and Sequel II optics are relatively nascent (2019). This suggests a lot of near-term headroom left for optimization in these areas.

It's crucial that all long-read users, not just the top 1%, have access to this innovation.

Read 15 tweets
20 Oct 20
Besides #immunotherapy, how can #CRISPR be used to treat cancer?

Researchers at @CNIOStopCancer just published an exciting proof-of-concept showing how CRISPR can delete cancer-causing gene fusions, selectively killing cancer cells.

I'll elaborate.

First, let's discuss what gene fusions are. As shown below, fusions result when two genes crash into each other and fuse together.

The resulting protein product is a hybrid. It has some features of Protein A and some of Protein B.

This usually is very bad.
We know that cancer-causing (#oncogenic) fusions have been found in nearly all cancer types. They're more common in pediatric cancers, but still are present in as many as 15-20% of adult cancers.

If present, fusions often are the main drivers of tumor growth.
Read 12 tweets
24 Sep 20
Interesting, Exact Sciences ($EXAS) is halted and spiking up ~15%, likely because of what's going on at the Cowen liquid biopsy conference. I will provide updates.
This is the first time, to my knowledge, Exact has seriously discussed multi-cancer liquid biopsy instead of just colorectal cancer screening via Cologuard. They presented preliminary data evaluating a blood-based multi-cancer test.
The cohort was relatively small, but showed sensitivity of ~85% (true-positive rate) and specificity of ~95% (true-negative rate). This is definitely the highest sensitivity I've seen from a test like this, but also the weakest specificity. Granted, this is early data.
Read 4 tweets
24 Sep 20
Rapid whole genome (🧬) sequencing (rWGS) is one of the most exciting (and benevolent) collisions of #AI and #genomics I can think of.

rWGS can diagnose a critically ill child in minutes where previously it took years.

A few years ago, Illumina ($ILMN) and Rady Children's Hospital (@RadyGenomics) collaborated to offer sequencing services for diagnosing critically-ill infants and toddlers.

Roughly 70% of rare diseases are genetic and they can take five years to diagnose.
As sequencing costs dropped and #AI got faster, this collaboration became Project Baby Bear: a pilot study for rWGS's diagnostic yield, clinical utility, and health economics in practice.

Several innovative companies joined Rady's in creating a rapid diagnostic pipeline.
Read 14 tweets
22 Sep 20
Hi, @7MaxxChatsko --

I agree with a lot of what you've laid out above. However, I think I should clarify some parts of my thread and offer counterpoints to a few of yours. I'm always game to trade notes.

I disagree that the DNA sequencing market is worth $10 billion. Today, it’s less than that. Should Illumina (a) drive unit prices lower (w/ super resolution, see below) & (b) help customers up the platform upgrade cycle to realize bleeding-edge OpE...

…that the market could be worth much more. I’ll cede that this position isn’t ideal because, as you point out, the vastest TAM is within clinical genomics. Still, investors could be ‘headed for the exits’ because their time horizons may not be long enough.
Read 18 tweets
21 Sep 20
Illumina ($ILMN) Acquires GRAIL: Pros, Cons, & Questions

Earlier today, Illumina announced its intent to acquire cancer-screening company GRAIL for $8 billion, marking its most direct foray into clinical #genetics.

Here's what we think:

🔗: businesswire.com/news/home/2020…
I'll begin with some positives (✅).

GRAIL's test (Galleri) is being evaluated in some of the largest clinical studies within genomics. Three of these studies are ongoing:

PATHFINDER (n=6,200; Ends Jan 2022)
STRIVE (n=99,481; Ends May 2025)
SUMMIT (n=50,000; Ends Aug 2030)
I'm basing timelines off of the study completion dates (see below). I'm doing this because I believe the secondary outcome measures are more relevant to commercialization and/or reimbursement, as is the case w/ STRIVE, for example.

Read 37 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!