Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

Sabrina Jaeger; Simone Fulle; Samo Turk

doi:10.26434/chemrxiv.5513581.v1

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

23 October 2017, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Keywords

Machine learning

Artificial neural networks

High dimensional embeddings

Feature engineering

Chemistry

Supplementary materials

Title

Description

Actions

Title

Jaeger et al 2017 - SI

Description

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

Sabrina Jaeger, Simone Fulle, Samo Turk journal article

Journal of Chemical Information and Modeling , Volume 58, Issue 1

Online publication date: Jan 10, 2018

Version History

Oct 23, 2017 Version 1

Metrics

8,264

5,477

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv.5513581.v1

Author’s competing interest statement

No conflict of interest

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

Authors

Abstract

Keywords

Supplementary materials

Comments

Now Published

Version History

Metrics

License

DOI

Author’s competing interest statement

Share