Abstract
Approaches for the prediction of PROTAC cell permeability are of major interest to reduce resource-demanding synthesis and testing of low-permeable PROTACs. We report a comprehensive investigation of the scope and limitations of machine learning-based binary classification models developed using simple 2D descriptors for large and structurally diverse sets of CRBN and VHL PROTACs. After construction and internal validation, the models were used for the prediction of blinded sets of PROTACs. For the VHL PROTAC set, kappa nearest neighbor and random forest models succeeded in predicting the permeability with >80% accuracy (k >0.57). Models retrained by combining the original training and the blinded set performed equally well for a second blinded VHL set. However, models for CRBN PROTACs were less successful, mainly due to the highly imbalanced nature of the CRBN datasets. We conclude that properly trained machine learning models can be integrated as effective filters in the PROTAC design process.