The challenge of building QSAR models
19 May 2020
Read time: 5 minutes
Building QSAR models is easy. Building useful QSAR models is harder. Building QSAR models you can trust – well that can be really challenging – and this is where a lot of Lhasa’s effort is dedicated – creating models that can be used with confidence to make big decisions – such as those made during a regulatory submission that can protect human safety through exposure of a compound or its impurities!
With modern software tools and a dataset, it isn’t hard to automatically calculate 1,000’s of descriptors, find those which appear to correlate and use those to build a model with a standard algorithm. Such a model might appear quite good – particularly if the endpoint is simple (ideally a single mechanistic cause of activity), the descriptors relevant (they show strong correlations with activity) and the test set favourable (a subset of the training set is great for making a model look good). Such a model might be useful – for example to explore SAR within a tight series or when the accuracy of an individual prediction is less important than ‘playing the odds’ to get more predictions right than wrong.
The difference between creating these ‘quick’ models to those you can trust, is massive. The latter demands much more rigour at every step of the process – and this takes time and expertise…
It starts with the collection and curation of high-quality data. This requires an understanding of the assays and experimental conditions used, which generally means returning to the primary literature source. Lhasa is fortunate to have expert scientists dedicated to this role. Given a robust dataset, focus shifts to the creation of descriptors. For us, these are not automatically generated from a library, but are individually designed based upon a detailed understanding of the underlying biological mechanisms. At this stage we do not focus on correlation, we aim to replicate biological recognition through our choice of descriptors, in order that our models perceive molecules in a similar way to the underlying molecular interaction. This takes a rare combination of expertise on the endpoint and outstanding cheminformatics skills to do well, but the reward is an accurate transparent model that is not so dependent on a perfect training set. Next comes the algorithm for model building and given the number of readily accessible models it would be wrong to imagine this is the easy step. All algorithms have underlying assumptions, limitations and needs. At the very least, selection must consider the size, type and variation of the data available, but add to this the complexity of the endpoint (for toxicity there are often many different mechanistic causes and much biological variability to contend with), and the need for a transparent predictions that can provide an accurate measure of confidence then rarely would an off-the-shelf algorithm be sufficient. For over 30 years Lhasa has dedicated considerable effort to developing the field of cheminformatic modelling including developing the fundamental science of applicability, reliability and confidence which are crucial if the model output is to be understood and trusted by the user and the decision made using it to be scientifically robust and defensible.
So if you need a quick QSAR model to understand the activity within a series or to select compounds with the best chance of (in)activity to test, then there has never been a greater availability of cheap or even free tools that can help you do that. If however you are looking for a prediction that you want to evaluate and trust; because the decision you make on the back of it is significant and must stand up to scrutiny, then pay attention to your choice of model and all the steps that have led to its creation.