Our goal was to decide on the library size needed to sequence a cohort of ~90 in-house Leukaemia RNA-Seq samples to allow good sensitivity for both DE analyses and mutation calling.
We used publicly available deeply sequenced RNA-Seq samples (from Leucegene) from the same cancer type and called mutations using 6 methods (combinations of 3 popular callers and filtering methods).
We called mutations on the initial and randomly downsampled data.
Previous work suggested that 20M PE reads is enough for accurate detection of DE genes, but our results suggest that 30M and 40M 100 bp PE reads are needed to recover 90–95% of the initial variants on recurrently mutated myeloid genes.
Our results showed that while the choice of a caller didn’t have a large impact on the sensitivity of SNV calling, targeted approaches are required for INDELs.
Considering only the SNVs on recurrently mutated myeloid genes, we were able to replicate a similar result using 136 RNA-Seq samples the TCGA-LAML cohort, namely a 6% average loss in sensitivity using 40M fragments instead of ~60M (max available in TCGA-LAML).
This study also provides a direct connection between the total library size and the on-site variant features (VAF and total depth). Across all strategies and in both datasets, the changes in sensitivities stabilise and remain above 75% when the total depth is larger than 20X.
This study is a starting point to inform cost-effective analyses of cancer transcriptomes. However, as cancers are extremely heterogeneous a rigorous assessment of the cohort characteristics is necessary to determine the optimal library size for variant detection.
• • •
Missing some Tweet in this thread? You can try to
force a refresh