Capturing the autonomous self-assembly of molecular building blocks in computer simulations is a persistent challenge, requiring to model complex interactions and to access long time scales. Advanced sampling methods allow to bridge these time scales but typically require to construct accurate low-dimensional representations of the transition pathways. In this work, we demonstrate for the self-assembly of two single-stranded DNA fragments into a ring-like structure how autoencoder architectures based on unsupervised neural networks can be employed to reliably expose transition pathways and to provide a suitable low-dimensional representation. The assembly occurs as a two-step process through two distinct half-bound states, which are correctly identified by the neural net. We exploit this latent space representation to construct a Markov state model for predicting the four molecular conformations and transition rates. Our work opens up new avenues for the computational modeling of multi-step and hierarchical self-assembly, which has proven challenging so far.