Beyond DNA: ML-Empowered Nanopore Base-Calling of 12-Letter Genetic Alphabets

Sneha Mittal; Milan Jena; Biswarup Pathak

doi:10.26434/chemrxiv-2024-xv61j

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Beyond DNA: ML-Empowered Nanopore Base-Calling of 12-Letter Genetic Alphabets

02 December 2024, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The standard 4-letter genetic code (A, T, G, C) is the blueprint of life on earth. However, beyond this foundational framework lies a realm of artificial genetics, which has now expanded the genetic code up to 12-letter (A, T, G, C, B, S, P, Z, X, K, J, V). Strikingly, at a time when detection methods for genomics and transcriptomics have progressed to their “fourth generation” with successful commercialization, the field of artificial genetics is still in its nascent stage, the “zeroth generation”. Herein, in the framework of DFT and machine learning (ML), we report a next-generation solid-state nanopore sequencing to both assess and decode the DNA code with expanded alphabets. For assessing, we leverage the ML regression tools, which predict the transmission signatures of each natural and xenonucleobase with low mean squared error as validated through DFT. Parameterizing SMILES (simplified molecular input line entry system) strings of expanded alphabets, including isomers, allows the structural, molecular, and bonding configuration of nucleobases to be meticulously incorporated during predictions. Further, custom ML classification tools are developed, and each standard, isoG/isoC, hachimoji, and supernumerary code is decoded with SHAP (Shapley Additive exPlanations) explainability. By introducing ML accelerated nanopore sequencing of supernumerary DNA, we pave the way for rapid analysis of expanded alphabets, offering insights into life’s possibilities across the cosmos.

Keywords

Supplementary materials

Title

Description

Actions

Title

Beyond DNA: ML-Empowered Nanopore Base-Calling of 12-Letter Genetic Alphabets

Description

Dynamic Configurations and Relative Energy Value; Tuned Hyperparameters for ML Regression Algorithms; Tuned Hyperparameters for ML Classification Algorithms; Learning Curve and Population Stability Index (PSI)

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Dec 02, 2024 Version 1

Metrics

438

118

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2024-xv61j

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Beyond DNA: ML-Empowered Nanopore Base-Calling of 12-Letter Genetic Alphabets

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share