Andre P.
Frade
a,
Patrick
McCabe
b and
Richard I.
Cooper
*a
aChemical Crystallography Laboratory, Department of Chemistry, University of Oxford, UK. E-mail: richard.cooper@chem.ox.ac.uk; Tel: +44 (0)1865 285000
bCambridge Crystallographic Data Centre, 12 Union Road, Cambridge, UK
First published on 12th March 2020
The performance of a model is dependent on the quality and information content of the data used to build it. By applying machine learning approaches to a standard chemical dataset, we developed a 4-class classification algorithm that is able to predict the hydrogen bond network dimensionality that a molecule would adopt in its crystal form with an accuracy of 59% (in comparison to a 25% random threshold), exclusively from two and lower dimensional molecular descriptors. Although better than random, the performance level achieved by the model did not meet the standards for its reliable application. The practical value of our model was improved by wrapping the model around a confidence tool that increases model robustness, quantifies prediction trust, and allows one to operate a classifier virtually up to any accuracy level. Using this tool, the performance of the model could be improved up to 73% or 89% with the compromise that only 34% and 8% of the total set of test examples could be predicted. We anticipate that the ability to adjust the performance of reliable 2D based models to the requirements of its different applications may increase their practical value, making them suitable to tasks that range from initial virtual library filtering to profile specific compound identification.
Most property prediction models are produced from feature vector representations.6 These consist of arrays of numbers representing chemical structure descriptors, such as molecular weight or number of hydrogen bond donors, which together build a molecule's profile. Different types of descriptors are available. Two and lower dimensional descriptors are those that can be rapidly derived from molecular formulas and diagrams at low computational cost. Despite their deterministic unambiguous computation, these are often limited in their information content, usually lacking any 3D-spatial arrangement information of the atoms.6 Given that molecules can co-exist in multiple conformations, these descriptors may be insufficient to fully describe a given property,7 especially those to which conformational flexibility is highly relevant.8,9 On the other hand, three and higher dimensional descriptors are able to capture the three-dimensional conformation of molecules and their interaction with the environment.6,10 Despite their high information content, these descriptors rely on the atomic coordinates of compounds, whose prediction is computationally expensive and cannot be guaranteed to correspond to the relevant conformation,1,3,11,12 which can considerably increase the runtime of the algorithm without adding any useful contribution,7 or even decreasing model performance.13 The deterministic character and potential information content of descriptors are key factors to consider during descriptor selection, as it will have implications on property description, but also on model performance, robustness and stability.9,14
The accuracies of models exclusively built from two and lower dimension descriptors tend to be lower, yet we believe that those performing reasonably better than random have an underestimated potential that is often left unexplored. Thus, we suggest the development of strategies that allow the exploitation of the prediction mechanism to provide valuable guidance on how to improve the model performance and its practical value.15,16
Hydrogen bond network dimensionality (HBND) describes how hydrogen-bond intermolecular interactions extend in a three-dimensional structure. The network expansion is guided by the set of available hydrogen bonding groups in a molecule and their allowed interactions.17 The resulting dimensionality is thought to be a major cause of anisotropic interactions in crystal structures due to its directional nature.18,19 Although its impact is not well characterised, dimensionalities often act as valuable complementary information to the study of properties that are directly influenced by slip plane arrangements in crystals, such as crystal stability, mechanical behaviour and tabletability performance.17,20,21
Bryant et al.17 recently described an automated method to assign the dimensionality of a hydrogen bond network from solved crystal structures, and have demonstrated its effectiveness on multiple drug systems by comparison with tabletability data. However, the reliance on solved crystal structures as input limits the large scale implementation of the tool. Crystals of compounds of interest are rarely available, obtaining them is resource and time consuming, and crystal structure predictions are computationally expensive and still not reliable enough for such application.3,11
Machine learning predictive models have been widely adopted as a good alternative to experimental property determination, and 2D based quantitative structural property relationship models (QSPRs) become particularly useful in the HBND context. The hydrogen bond network dimensionality problem can be formulated as a four-class classification task, and the four possible network dimensionality outcomes are schematically represented in Fig. 1. In this work we present the possibility of hydrogen bond network dimensionality prediction to any region of chemical space, such that the screening of large virtual libraries becomes feasible and reliable. We further develop and test a confidence measure that adds robustness to classification algorithms and quantifies the trust of each output prediction. The tool also enables one to adjust the compromise between accuracy level and prediction output accessibility that best suits the requirements of the context under which the model is used. This approach may enable additional 2D-based model applications, such as robust single molecule property prediction or production of structure–property relationship insights.
As expected, the random performance threshold was found to correspond to 25% accuracy. All models performed considerably better than random, providing evidence that the data is indeed informative of the property (ESI‡ B).
All optimized models achieved similar results (ESI‡ B). The multiclass implementation of the SVM RBF (radial basis function kernel) slightly outperformed the others, attaining a total accuracy of 59% on the test set. The corresponding confusion matrix can be seen in Fig. 3, left. The classifier is able to detect each class with an accuracy considerably higher than random (random accuracy value of 0.25). The model was further tested on the 51080 examples discarded during class size balancing, where the accuracy per class remained effectively unchanged, demonstrating the generalization capability of the model. Ultimately, these findings suggest that hydrogen bond network dimensionality can be approximately estimated from two dimensional molecular descriptors. We also notice that misclassified examples tend to be assigned to adjacent classes, suggesting that the definition of network dimensionalities is a continuum, and so there isn't a well-defined boundary between adjacent classes.
Fig. 3 Confusion matrix and learning curve results for SVM RBF models trained on two dimensional descriptors. The standard deviations of the learning curve values are indicated by the shaded areas. |
The learning curve (Fig. 3, right) shows that the model tends to generalise well to unseen examples, suggesting that no overfitting has occurred during the training stage. The lack of convergence between training and cross validation score lines shows high variance and suggests that the model performance could be improved. Learning curves built from accuracy scores also provide upper bounds for how good a model can get using the set of descriptors considered. The upper bound corresponds to the accuracy at which both line scores would theoretically converge, which we estimate from the learning curve to be between 60% and 65%. As expected, these findings confirm the limitations of predicting three dimensional properties like hydrogen bond network dimensionality exclusively from two and lower dimensional molecular descriptors. Such datasets may be incomplete in scenarios where, for example, a single compound defined by a unique set of two and lower dimensional features may have the ability of adopting different packing arrangements (polymorphs) which may lead to different network dimensionalities in their crystal form.17
First we investigated the benefits of considering confidence thresholds. To determine how many predictions were facing a small probability difference between their two most probable classes, the test set was first evaluated by the model with no confidence threshold and then subject to a 5% confidence threshold. We found that whilst all examples would be predicted under no confidence threshold, only 90% of the test set could be confidently predicted when the 5% confidence threshold was applied. This means that 262 examples were being assigned to a given class with only <5% difference between the top two probability estimates. Running models with very low confidence thresholds suggested that some of the correct answers that the model outputs when no thresholds are implemented turn out to be lucky guesses. This sensitivity implies that the performance of models with no minimal confidence restriction can rapidly decrease when faced with noisier datasets. Thus, we conclude that confidence threshold implementation is an efficient way to improve the robustness and reliability of a model.
We tested the effect of increasing confidence thresholds on the fraction of test examples that a model can predict with confidence and the corresponding accuracy. The model was used to predict HBND for the complete test set under different confidence thresholds. The results are shown in Fig. 4. For each confidence threshold used, there is a pair of red and blue dots representing the percentage of test examples that were predicted with confidence and corresponding prediction accuracy. Generally, as the confidence threshold increases, the accuracy of confident estimations also increases and does improve considerably. Conversely, the percentage of test examples that the model is able to predict with confidence drops rapidly. For example, whilst the absence of a confidence restriction allowed the model to predict the complete test set with an accuracy of 59%, a 30% threshold enabled the model to output predictions for 34% of the test set with an accuracy of 73%, or for 8% of the test set with an accuracy of 89% when the threshold was increased to 60%.
Fig. 4 Effect of confidence thresholds on the percentage of the test set predicted with confidence and correspondent prediction accuracy. |
When the number of confident predictions gets too small, meaningful statistics about the general performance of the model cannot be derived. As shown in Fig. 4, the prediction accuracy for confidence values above 60% are obtained from small samples that are no longer a good representation of the original data distribution. In our case, 60% is the maximum confidence threshold to be adopted for the computation of a meaningful overall model performance. We stress that despite this, the confidence associated with the outputs obtained at high thresholds is still valid.
In summary, confidence thresholds make it possible to operate this model up to any achievable desired level of accuracy, however a compromise between accuracy and access to answers is required.
Finally, we use the confidence restriction to predict the test set over seven classification rounds of decreasing confidence thresholds. The idea was to feed into the classification round all the test examples that the model was not able to confidently predict in the previous round, so the number of confident guesses could be maximised. From the previous results (Fig. 4), it seems reasonable to start the first round with the highest confidence threshold of 60%.
We also note that the confidence threshold can be continuously decreased, as long as each round of classification still performs better than random. The results are showed in Table 1.
Conf. threshold | Conf. predictions | Right predictions | Round accuracy |
---|---|---|---|
60% (round 1) | 198 | 176 | 89% |
50% (round 2) | 131 | 99 | 76% |
40% (round 3) | 229 | 164 | 72% |
30% (round 4) | 323 | 206 | 64% |
20% (round 5) | 501 | 274 | 55% |
10% (round 6) | 567 | 284 | 50% |
0% (round 7) | 651 | 267 | 41% |
As expected, gradually relaxing the confidence restriction enables the estimation of progressively less confident new answers at each round, which increases the fraction of the test set predicted. The true value of this approach is its ability to accommodate any number of rounds and threshold values, such that the number of confident answers can be maximised whilst controlling the overall accuracy. Likewise, the setup enables the discrimination of estimation based on prediction trust. For a given round, the confidence associated to the output answers is known to lie between the confidence threshold that the current and previous round were subjected to. Moreover, the possibility of fine tuning the confidence threshold step between rounds allows one to increase the discrimination between different levels of prediction trust. Ultimately, it becomes possible to quantify the prediction trust associated with each prediction.
In conclusion, we believe that the confidence restriction tool offers the possibility of tailoring the performance of a given probability-generating classification model to the risk and cost requirements of each project.
The CSD was searched for all organic crystal structures of a single chemical component, excluding any metals, salts, and ions, as these present additional challenges11 that won't be addressed in this study. Entries with disorder, errors or incomplete information about crystal atomic coordinates or hydrogen bonds were discarded, as they would not provide enough information for accurate network dimensionality calculation. Of these, molecules with more than one crystal structure submitted to the database were removed. This step removes conflicting data where the compound is polymorphic and its different crystal arrangements are reported to have different network dimensionalities.17 Accounting for this scenario would result in a multi label classification task that will not be covered in this paper. In a few other cases, different submissions of the same crystal were calculated to have different network dimensionalities, which may relate to the quality of crystal data and sensitivity of the dimensionality calculation tool on the definition of a hydrogen bond interaction.
All entries meeting the above search criteria were subject to hydrogen bond network dimensionality and numerical descriptor calculation. Label assignment was based on a modification to the method of Bryant et al.17 Dimensionality was calculated through the computation of the square roots of the eigenvalues of the covariance matrix of the atomic coordinates of the supramolecular structures that resulted from two different expansions of the network, through hydrogen bond intermolecular interactions, using methods from the CSD Python API. Ratios for each dimension before and after the expansion were calculated, to deduce the number of directions in which the network grew. One hundred and fifteen descriptors of two and lower dimensions were calculated for each molecule using the RDkit package. The full list of descriptors can be found in the ESI‡ A.
The method has two main hyperparameters, which may greatly affect the final dataset visualization. Perplexity is responsible for the balance between conserving the local and global structure of data, whilst the learning rate controls step size of the optimisation procedure. Different hyper parameter values were tested, and although the arrangement of the points varies between projections, the general effect and overall conclusion are consistent. The visualization shown was produced with a learning rate of 10 and a perplexity of 40, which lies within the limits of 5 and 50 recommended by Hinton et al.24 Results were visualized under a colour scheme matching points to the class they belong to.
We report a 4 class classifier that estimates the hydrogen bond network dimensionality that organic compounds may produce in a crystal structure, with an accuracy of 59% (where 25% is random). The limitations of predicting three dimensional properties from two dimensional chemical information have been discussed.
Model performance could not be improved further with the data at hand, but we demonstrate that the model's practical use could be improved by increasing the confidence of its output predictions. The confidence restriction proved efficient in adding robustness to the model by filtering marginal classification events due to noise in data, which we suggest as a good practice to be adopted for any classifier that is capable of outputting probabilities. The system further allows one to adjust the model's performance, maximize the number of confident predictions and discriminate them according to level of prediction trust. Nevertheless, a compromise between accuracy and access to answers is required for the achievement of useful results.
We anticipate that the HBND classification model may be useful to the pharmaceutical sector to support the early identification of molecules with high chances of exhibiting low plasticity levels or poor tabletability performance,17 so precautions can be taken from the beginning of the drug development pipeline. More broadly, we envisage that the confidence restriction measure may be a useful complementary tool for increasing the practical value of any probability-generating classification algorithm.
Footnotes |
† The model and confidence restriction measure codes are publicly available online. HBND model: https://github.com/APFrade/HBNDmodel. Confidence restriction: https://github.com/APFrade/ConfidenceMeasure. |
‡ Electronic supplementary information (ESI) available. See DOI: 10.1039/d0ce00111b |
This journal is © The Royal Society of Chemistry 2020 |