Machine learning model for predicting molecular activity using Molecular Descriptors and ElectroShape Descriptors - FL051
- Authors: Alessia Bono; Aras Asad; Marco Albanese; Richard Ian Cooper; Antonino Lauria; Gabriele La Monica; Federica Alamia; Annamaria Martorana; Paul William Finn
- Publication year: 2024
- Type: Contributo in atti di convegno pubblicato in volume
- OA Link: http://hdl.handle.net/10447/668603
Abstract
Machine Learning (ML) algorithms are revolutionizing pharmaceutical and biomedical research by analyzing large datasets to make predictions across various fields. A key challenge is learning useful molecular representations. Here, we present the development of an ML model designed to discriminate between active and inactive compounds towards biological targets. Our approach utilized Molecular Descriptors (MD), computed via MOLDESTO [1], and 4D ElectroShape Descriptors (4DES) [2] to represent small molecules in the Maximum Unbiased Validation (MUV) dataset. MDs provide features capturing various physicochemical properties of small molecules. On the other hand, 4DES are advanced descriptors that incorporate three-dimensional spatial and electronic properties, offering a richer representation for capturing subtle molecular differences. The MUV dataset, including assay data for 17 targets, minimizes analogue bias and artificial enrichment. We developed a Support Vector Machine (SVM) model and applied 5-fold cross-validation to assess its performance using metrics such as Area Under the Curve (AUC) and the 1% Enrichment Factor (1ï). Initial results showed that models using MD alone achieved an AUC of 0.759 ± 0.099, while 4DES alone achieved 0.635 ± 0.113. Combining MD and 4DES slightly improved performance, reaching an AUC of 0.763 ± 0.099. Further refinement through SHapley Additive exPlanations (SHAP) for feature selection improved the AUC values to 0.810 ± 0.082 for MD alone and 0.811 ± 0.084 for the combined descriptors. These findings underscore the importance of feature selection in enhancing model performance. Additionally, we performed a 3-fold cross-validation by splitting the MUV datset into three groups based on target numerical codes and FASTA sequence similarity. This demonstrated the model’s robustness to discriminate between actives and inactives across different test scenarios. Finally, we evaluated the model on the LIT-PCBA dataset, achieving promising performance. These findings highlight the effectiveness of our approach and its potential for advancing virtual screening in drug discovery.