These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
g6-LGI-GEN_v01b_corrected.pdf (1.39 MB)
Deep Generative Model for Sparse Graphs using Text-Based Learning with Augmentation in Generative Examination Networks
Preprints are manuscripts made publicly available before they have been submitted for formal peer review and publication. They might contain new research findings or data. Preprints can be a draft or final version of an author's research but must not have been accepted for publication at the time of submission.
Graphs and networks are a key research tool for a variety of science fields, most notably chemistry, biology, engineering and social sciences. Modeling and generation of graphs with efficient sampling is a key challenge for graphs. In particular, the non-uniqueness, high dimensionality of the vertices and local dependencies of the edges may render the task challenging. We apply our recently introduced method, Generative Examination Networks (GENs) to create the first text-based generative graph models using one-line text formats as graph representation. In our GEN, a RNN-generative model for a one-line text format learns autonomously to predict the next available character. The training is stopped by an examination mechanism checking validating the percentage of valid graphs generated. We achieved moderate to high validity using dense g6 strings (random 67.8 +/- 0.6, canonical 99.1 +/- 0.2). Based on these results we have adapted the widely used SMILES representation for molecules to a new input format, which we call linear graph input (LGI). Apart from the benefits of a short compressible text-format, a major advantage include the possibility to randomize and augment the format. The generative models are evaluated for overall performance and for reconstruction of the property space. The results show that LGI strings are very well suited for machine-learning and that augmentation is essential for the performance of the model in terms of validity, uniqueness and novelty. Lastly, the format can address smaller and larger dataset of graphs and the format can be easily adapted to define another meaning of the characters used in the LGI-string and can address sparse graph problems in used in other fields of science.