TY - JOUR
T1 - Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
AU - Wang, Ziyuan
AU - Liu, Ziyang
AU - Fang, Yinshan
AU - Zhang, Hao Helen
AU - Sun, Xiaoxiao
AU - Hao, Ning
AU - Que, Jianwen
AU - Ding, Hongxu
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a substantial challenge in nanopore sequencing bioinformatics. It has been extensively demonstrated that state-of-the-art basecallers are less compatible with modification-induced sequencing signals. A precise basecalling, on the other hand, serves as the prerequisite for virtually all the downstream analyses. Here, we report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications. With synthesized oligos as the model system, we precisely basecall various out-of-sample RNA modifications. From the representation learning perspective, we attribute this generalizability to basecaller representation space expanded by diverse training modifications. Taken together, we conclude increasing the training data diversity as a paradigm for building modification-tolerant nanopore sequencing basecallers.
AB - Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a substantial challenge in nanopore sequencing bioinformatics. It has been extensively demonstrated that state-of-the-art basecallers are less compatible with modification-induced sequencing signals. A precise basecalling, on the other hand, serves as the prerequisite for virtually all the downstream analyses. Here, we report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications. With synthesized oligos as the model system, we precisely basecall various out-of-sample RNA modifications. From the representation learning perspective, we attribute this generalizability to basecaller representation space expanded by diverse training modifications. Taken together, we conclude increasing the training data diversity as a paradigm for building modification-tolerant nanopore sequencing basecallers.
UR - http://www.scopus.com/inward/record.url?scp=85215954988&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85215954988&partnerID=8YFLogxK
U2 - 10.1038/s41467-025-55974-z
DO - 10.1038/s41467-025-55974-z
M3 - Article
C2 - 39814719
AN - SCOPUS:85215954988
SN - 2041-1723
VL - 16
JO - Nature communications
JF - Nature communications
IS - 1
M1 - 679
ER -