We are excited to release a GIAB benchmark for 273 challenging medically relevant genes like SMN1 with @sedlazeck@infoecho and many others, using a hifiasm diploid assembly from @lh3lh3. In the process we identified and corrected false duplication errors in GRCh38 1/n
We use the new benchmark to demonstrate a solution that improves short read accuracy from 8% to 100% in important genes on GRCh38 (CBS, CRYAA, and KCNE1), working with @GenomeRef to mask 5 GRCh38 false duplications on chr21, building on recent results from @chrisamiller et al 2/n
With #T2T we are developing a more comprehensive list of false duplications that cover >1Mbp in GRCh38 so stay tuned for an improved masked GRCh38 soon but an initial masked GRCh38 is under ftp-trace.ncbi.nlm.nih.gov/ReferenceSampl…. The #T2T reference from @aphillippy@khmiga fixes these as well!
Notes from our curation of >1000 variants in this benchmark are in Supplementary File 5, and there are a lot of nice examples of short and long read alignments to challenging genes in IGV screenshots in the main and supplementary figures. 4/n
We also were able to test performance with and without using the hs37d5 decoy sequence for GRCh37, illuminating many places where it reduces false positives as intended, but also a few where it creates false negatives in the genes CYP4F12 and LMF1
• • •
Missing some Tweet in this thread? You can try to
force a refresh