Abstract
Mass spectrometry is a widely used technique for identifying molecules in a variety of applications, including organic synthesis and metabolomics. Recently,a deep neural network model for mass spectrometry has been developed and numerically assessed. However, we confirmed that the model performs poorly for a specific target such as highly-fluorinated compound, this study introduces a simple dataset undersampling scheme using a molecular similarity. The model trained on the undersampled dataset shows that the predictive performance was improved for fluorinated compounds and was relatively maintained even for non-fluorinated compounds. This performance is probably ascribed to the reduction of bit collisions of ECFPs. The undersampling approach is general and applicable to any specific target.