Nanopore sequencing allows identification of base modifications, such as methylation, directly from raw current data. Prevailing approaches, including deep learning (DL) methods, require training data covering all possible sequence contexts. This data can be prohibitively expensive or impossible to obtain for some modifications. Hence, research into DNA modifications focuses on the most prevalent modification in human DNA: 5mC in a CpG context. Improved generalisation is required to reach the technology’s full potential: calling any modification from raw current data.
We developed ReQuant, an algorithm to impute full, k-mer based, modification models from limited k-mer context examples. Our method is highly accurate for calling modifications (CpG/GpC methylation, and CpG glucosylation) in Lambda phage R9 data when fitting on ≤25% of all possible 6-mers with a modification, and extends to human R10 data.
The success of our approach shows that DNA modifications have a consistent and therefore predictable effect on Nanopore current levels, suggesting that interpretable rule-based imputation in unseen contexts is possible. Our approach circumvents the complexity of modification-specific DL tools and enables modification calling when not all sequence contexts can be obtained, opening up a vast field of biological base modification research.
Less...