Happy to introduce CALDERA, a microbial GWAS method by @hector_rdb extending DBGWAS to better detect polymorphic sequences.

TLDR: we do one test for each closed connected subgraphs of the DBG, where DBGWAS did one per node. Each subgraph can capture several versions of a gene.
De Bruijn graphs (DBGs) connect overlapping k-mers.

Compaction groups simple paths in a single node representing longer subsequences termed unitig.

Local variation (e.g., SNPs) creates bubbles in the graph.
DBGWAS tests the association between a phenotype and the presence profile of unitigs. It then relies on a compacted de Bruijn graph built across all strains to visualize the significant unitigs.

A gene with a unique set of k-mers and the exact same sequence in all strains would be compacted to a unitig and lead to a single test in DBGWAS.

But in practice, several variants of a gene co-exist. Each unitig may only be present in a subset of strains that contain the gene.
DBGWAS helps visualize this situation, because significant unitigs originating from the same polymorphic gene or plasmid lead to quasi-linear connected subgraphs in the compacted DBG.

But testing at the unitig level is less powerful than testing at the gene/plasmid level.
CALDERA solves this issue by testing the association of the phenotype with the presence profile of any unitig in a connected subgraph (i.e., a logical "or" over the profiles of its unitigs).
The number of these subgraphs is exponential in the
size of the DBG (a few million nodes).

We rely on the concept of testability to both limit the number of explored subgraphs and their impact on multiple testing correction.
The intuitive idea of testability is that the presence profile of many (larger) subgraphs will mostly contain 1s.

Such a profile cannot possibly be associated with any phenotype and can be discarded without counting towards correction.
Our main technical contribution is an algorithm that efficently enumerates subgraphs while allowing for testability-based pruning.

CALDERA handles a 20,000 node DBG in a few hours while existing methods would take 2+ days.

It is also more powerful than unitig-level testing.
CALDERA doesn't scale to the millions of nodes of bacterial GWAS, but even a few stages of our exploration algorithm help recover long polymorphic sequences (here, a plasmid).
The preprint is here: biorxiv.org/content/10.110…

And we provide a software here:
github.com/HectorRDB/Cald… with a visualization tool akin to DBGWAS, thanks to @leandro_ishi
All this was achieved by @hector_rdb during his PhD with @cendrinou @UCBStatistics and @fannyperraudeau @Pendulum_Co.

Congrats for this great work :)
We were greatly inspired by the successful use of testability by @kmborgwardt and colleagues to test genomic intervals (with @udo_g, @hito_maro and @random_roll:
academic.oup.com/bioinformatics…) or combinations of features (with @Laetitia_Ppx: papers.nips.cc/paper/2016/has…)
The same group also has a recent preprint with @Giulia_Muzio, @leslieobray, @Laetitia_Ppx and @JulianeKlatt on testing subgraphs, with a different paradigm (biorxiv.org/content/10.110…).

If you are interested by CALDERA, you should also check it out!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Laurent Jacob

Laurent Jacob Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ljacob

29 Aug 19
@metapredict The GEO repository did not specify which groups were used for the ROC curves. I offered Dr Timmons to re-run my analysis if he sent me this information but he didn't and when the editor asked him to provide a code reproducing his results, he declined and threatened legal action.
@metapredict Most people I told about this answered that it was nonsense as there
was no way such an action could be won. Eventually there was indeed no
action, but I don't think these threats are a nonsense.
@metapredict The editor only asked for cosmetic changes in my correspondence, yet
it took more than two years to be published. Mostly waiting for Genome
Biology's legal counsel and editorial integrity departments to give
green lights.
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(