Tweet

Laurent Jacob

Follow @ljacob

3 Dec, 14 tweets, 8 min read

@hector_rdb

Happy to introduce CALDERA, a microbial GWAS method by @hector_rdb extending DBGWAS to better detect polymorphic sequences.

TLDR: we do one test for each closed connected subgraphs of the DBG, where DBGWAS did one per node. Each subgraph can capture several versions of a gene.

De Bruijn graphs (DBGs) connect overlapping k-mers.

Compaction groups simple paths in a single node representing longer subsequences termed unitig.

Local variation (e.g., SNPs) creates bubbles in the graph.

https://twitter.com/ljacob/status/1085556757383077890

DBGWAS tests the association between a phenotype and the presence profile of unitigs. It then relies on a compacted de Bruijn graph built across all strains to visualize the significant unitigs.

https://twitter.com/ljacob/status/1085556757383077890

A gene with a unique set of k-mers and the exact same sequence in all strains would be compacted to a unitig and lead to a single test in DBGWAS.

But in practice, several variants of a gene co-exist. Each unitig may only be present in a subset of strains that contain the gene.

DBGWAS helps visualize this situation, because significant unitigs originating from the same polymorphic gene or plasmid lead to quasi-linear connected subgraphs in the compacted DBG.

But testing at the unitig level is less powerful than testing at the gene/plasmid level.

CALDERA solves this issue by testing the association of the phenotype with the presence profile of any unitig in a connected subgraph (i.e., a logical "or" over the profiles of its unitigs).

The number of these subgraphs is exponential in the
size of the DBG (a few million nodes).

We rely on the concept of testability to both limit the number of explored subgraphs and their impact on multiple testing correction.

The intuitive idea of testability is that the presence profile of many (larger) subgraphs will mostly contain 1s.

Such a profile cannot possibly be associated with any phenotype and can be discarded without counting towards correction.

Our main technical contribution is an algorithm that efficently enumerates subgraphs while allowing for testability-based pruning.

CALDERA handles a 20,000 node DBG in a few hours while existing methods would take 2+ days.

It is also more powerful than unitig-level testing.

CALDERA doesn't scale to the millions of nodes of bacterial GWAS, but even a few stages of our exploration algorithm help recover long polymorphic sequences (here, a plasmid).

@leandro_ishi

The preprint is here: biorxiv.org/content/10.110…

And we provide a software here:
github.com/HectorRDB/Cald… with a visualization tool akin to DBGWAS, thanks to @leandro_ishi

@hector_rdb

All this was achieved by @hector_rdb during his PhD with @cendrinou @UCBStatistics and @fannyperraudeau @Pendulum_Co.

Congrats for this great work :)

@kmborgwardt

We were greatly inspired by the successful use of testability by @kmborgwardt and colleagues to test genomic intervals (with @udo_g, @hito_maro and @random_roll:
academic.oup.com/bioinformatics…) or combinations of features (with @Laetitia_Ppx: papers.nips.cc/paper/2016/has…)

@Giulia_Muzio

The same group also has a recent preprint with @Giulia_Muzio, @leslieobray, @Laetitia_Ppx and @JulianeKlatt on testing subgraphs, with a different paradigm (biorxiv.org/content/10.110…).

If you are interested by CALDERA, you should also check it out!

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Laurent Jacob

Try unrolling a thread yourself!

More from @ljacob

Laurent Jacob

Did Thread Reader help you today?

Like this author's thread?