Theoretical and Computational Chemistry

Reconstruction of lossless molecular representations



SMILES is the most dominant molecular representation used in AI-based chemical applications, but also responsible for certain issues associated with its internal structure. Here, we exploit the idea that structural fingerprints may be used as efficient alternatives to unique molecular representations. For this purpose, we assessed the conversion efficiency of fingerprints back to the molecules. We successfully reconstructed molecules with the NMT approach, achieving a high level of accuracy. Our approach therefore brings structural fingerprints into play as strong representational tools in chemical NLP applications by restoring the connectivity information that is lost during the fingerprint transformation. This comprehensive study addresses the major limitation of structural fingerprints which precludes their implementations in NLP models. Our findings should enhance the efficiency of the models in generative and translational fields.


Thumbnail image of MolForge_MS.pdf

Supplementary material

Thumbnail image of Molforge_SI.pdf
Supplementary Information
Supporting Figures and Tables.