ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
g6-LGI-GEN_v01b_corrected.pdf (1.39 MB)
0/0

Deep Generative Model for Sparse Graphs using Text-Based Learning with Augmentation in Generative Examination Networks

preprint
revised on 07.10.2019 and posted on 11.10.2019 by Ruud van Deursen, Guillaume Godin
Graphs and networks are a key research tool for a variety of science fields, most notably chemistry, biology, engineering and social sciences. Modeling and generation of graphs with efficient sampling is a key challenge for graphs. In particular, the non-uniqueness, high dimensionality of the vertices and local dependencies of the edges may render the task challenging. We apply our recently introduced method, Generative Examination Networks (GENs) to create the first text-based generative graph models using one-line text formats as graph representation. In our GEN, a RNN-generative model for a one-line text format learns autonomously to predict the next available character. The training is stopped by an examination mechanism checking validating the percentage of valid graphs generated. We achieved moderate to high validity using dense g6 strings (random 67.8 +/- 0.6, canonical 99.1 +/- 0.2). Based on these results we have adapted the widely used SMILES representation for molecules to a new input format, which we call linear graph input (LGI). Apart from the benefits of a short compressible text-format, a major advantage include the possibility to randomize and augment the format. The generative models are evaluated for overall performance and for reconstruction of the property space. The results show that LGI strings are very well suited for machine-learning and that augmentation is essential for the performance of the model in terms of validity, uniqueness and novelty. Lastly, the format can address smaller and larger dataset of graphs and the format can be easily adapted to define another meaning of the characters used in the LGI-string and can address sparse graph problems in used in other fields of science.

History

Email Address of Submitting Author

ruud.van.deursen@firmenich.com

Institution

Firmenich SA

Country

Switzerland

ORCID For Submitting Author

0000-0002-8124-7325

Declaration of Conflict of Interest

No conflict of interest

Version Notes

Version 0.2, corrected typos in the authors' email address.

Exports