Happy to introduce CALDERA, a microbial GWAS method by @hector_rdb extending DBGWAS to better detect polymorphic sequences.
TLDR: we do one test for each closed connected subgraphs of the DBG, where DBGWAS did one per node. Each subgraph can capture several versions of a gene.
De Bruijn graphs (DBGs) connect overlapping k-mers.
Compaction groups simple paths in a single node representing longer subsequences termed unitig.
Local variation (e.g., SNPs) creates bubbles in the graph.
DBGWAS tests the association between a phenotype and the presence profile of unitigs. It then relies on a compacted de Bruijn graph built across all strains to visualize the significant unitigs.
A gene with a unique set of k-mers and the exact same sequence in all strains would be compacted to a unitig and lead to a single test in DBGWAS.
But in practice, several variants of a gene co-exist. Each unitig may only be present in a subset of strains that contain the gene.
DBGWAS helps visualize this situation, because significant unitigs originating from the same polymorphic gene or plasmid lead to quasi-linear connected subgraphs in the compacted DBG.
But testing at the unitig level is less powerful than testing at the gene/plasmid level.
CALDERA solves this issue by testing the association of the phenotype with the presence profile of any unitig in a connected subgraph (i.e., a logical "or" over the profiles of its unitigs).
The number of these subgraphs is exponential in the
size of the DBG (a few million nodes).
We rely on the concept of testability to both limit the number of explored subgraphs and their impact on multiple testing correction.
The intuitive idea of testability is that the presence profile of many (larger) subgraphs will mostly contain 1s.
Such a profile cannot possibly be associated with any phenotype and can be discarded without counting towards correction.
Our main technical contribution is an algorithm that efficently enumerates subgraphs while allowing for testability-based pruning.
CALDERA handles a 20,000 node DBG in a few hours while existing methods would take 2+ days.
It is also more powerful than unitig-level testing.
CALDERA doesn't scale to the millions of nodes of bacterial GWAS, but even a few stages of our exploration algorithm help recover long polymorphic sequences (here, a plasmid).
@metapredict The GEO repository did not specify which groups were used for the ROC curves. I offered Dr Timmons to re-run my analysis if he sent me this information but he didn't and when the editor asked him to provide a code reproducing his results, he declined and threatened legal action.
@metapredict Most people I told about this answered that it was nonsense as there
was no way such an action could be won. Eventually there was indeed no
action, but I don't think these threats are a nonsense.
@metapredict The editor only asked for cosmetic changes in my correspondence, yet
it took more than two years to be published. Mostly waiting for Genome
Biology's legal counsel and editorial integrity departments to give
green lights.