We can apply a lot about how we trust one another when building and using in silico models
Read time: 15 minutes
One of the most important ingredients for an effective team is trust, and a critical phase of team building is starting to develop that sense of trust – to be strong has to be earnt and regularly reinforced; it is rarely simply given! The same is true when using in silico models and by mirroring the natural approaches – the intuitive tests that we apply either consciously or subconsciously - we can understand how and when a model can be effectively used. From the perspective of a model creator, we can see how to expose information that will allow a user to explore and to know when to accept a model’s prediction. This information isn’t the typical statistical analysis of performance against a test set that we regularly see. In fact, only when the consequences of being wrong are low is an approach of ‘playing the odds’ a good one to use. Generally, when faced with a decision to accept and act on an in silico model, past performance is of limited comfort. At that instant, history fades and the only question is ‘Can I accept this particular prediction and act?’ – past performance against a dataset that may have no clear relationship to the compound in question is not going to be convincing.
This thinking has been inspirational in our development of in silico models and the science of confidence we have been developing at Lhasa Limited. Unless a model is infallible nobody should blindly trust it; much better is to approach any output with caution, applying tests to explore the reliability, robustness and appropriateness of a model before using it.
Two of our models are widely used to address a regulatory guideline around the risk of mutagenic impurities within a drug product (ICH M7). Derek Nexus is an expert (rule-based) system and Sarah Nexus is a statistical (machine learning) approach, and both are used in parallel. Wrangling the outputs from 2 models into a decision requires ‘expert review’ and we’ve spent a lot of time discussing and testing how our members (from both industry and regulators) tackle this problem. We then published ‘best practise’ papers to capture and share this thinking and developed a set of approaches to make this process easier within our software.
Exploring the process of expert review when using 2 different models can mirror human experience. Imagine asking 2 friends for an opinion and they disagree – which opinion do you accept? You might look at their past performance – one friend has often been wrong so maybe it would be better to dismiss that advice. That’s the statistical option – you could gamble that the advice is normally wrong, but what if this time it is right? So, while statistical performance may cause you to look more carefully, it is rarely decisive to the end user despite the focus that model builders often put upon it. What if you asked each how confident they were in their decision? – that may help but you’ll probably want to test whether that confidence is well-placed. A friend who can accurately assess their ability to be right and tell you how certain they are is an asset, because they can tell you how much to trust them. Sadly, while most models provide a measure of confidence, many fail to demonstrate that this measure correlates with accuracy. This can be hard to do well but because it is important, we have invested research into developing measures of confidence that work, subsequently publishing both the science and the performance of our models against these measures.
As we continue to question our friends in order to build confidence in their opinions, a good question to ask is ‘Did you get predictions like this one right last time?’ – so now we are moving from a global measure of accuracy to a local one. For this to work with a computer model, the key is understanding the context and specifically what ‘similar’ means. This is not always easy but can be done and was the inspiration behind functionality within Derek where we tackled the challenge of separating out different reasons for negative predictions. If the model can only report that ‘I don’t see a reason for a positive outcome, therefore it is negative’ then the user may not be terribly convinced (how do you know the limits of the model)? Whereas if a model reports ‘I’ve seen similar situations and been right in those cases’ then that is much more convincing speaking of both applicability of the model and of local performance. Being specific is almost always more compelling and this work gave us ‘misclassified features’ where the model identifies similar compounds that have been mispredicted by the model. This immediately and honestly describes a form of uncertainty equivalent to a human expert saying ‘I don’t see any reason for activity but must admit to having been wrong with some similar compounds’. Almost giving a model self-awareness is likely to increase trust in the end user. The other type of negative predictions that Derek can give is ‘negative with unclassified features’ – in this case, whilst there are no reasons for a positive prediction of activity, a particular feature may be in an unusual context and so it offers any specific case of uncertainty to the end user. With an endpoint like genotoxicity which is mostly driven by reaction with DNA, these warnings are often readily dismissed by a user with an understanding of chemical reactivity1.
Ultimately the best questions to ask both friends are ‘why?’ and ‘where is your evidence?’. It is much easier to trust and accept an opinion if you have evidence that the friend understands the domain and has used and can provide evidence to draw a conclusion. With computer models, users can look for evidence that the model (or the model builder) understands the area, can explain reasons behind a prediction and can furnish relevant supporting data. In doing so, it is important to also expose contradictory evidence. If the model can suggest a reason to dismiss that contradictory evidence, then we are getting close to emulating a human expert. Sarah does this by showing training compounds that whilst relevant to a prediction, would have also fired an alert in Derek for a different reason. This doesn’t mean that it is justifiable to immediately dismiss a positive compound because there is an alternative explanation for activity2, however the user may then reconsider the weight to place on that evidence.
As your friend starts to detail the evidence you may well ask ‘how certain are you in your facts?’ – a question also easy to replicate for a computer model. Within Lhasa, data extracted from public sources is captured in a database and internally peer reviewed for accuracy. This doesn’t mean that we can be certain that the result is immutable, but that an expert has reviewed the evidence to draw a conclusion is a powerful benefit of expert curation – particularly in those cases where non-standard experimental conditions were used. Within Sarah, it is possible to review that underlying data or return to the primary literature; in the case of multiple determinations, the user can see if all agreed. For the very specific area ofAmes mutagenicity, data is generated using multiple bacteria/salmonella strains and it is possible to view the strain profile of both the individual compound and the collection of compounds within a hypothesis. This allows the user to quickly assess the reliability of any key data points.
If a predictive computer model is to be useful, the user must be able to answer key questions to establish sufficient trust in the outcome in order to accept and act upon the result. What types of evidence can help a user start to trust a model? The following shows the types of questions that a user might ask and how we answer in Derek Nexus – an expert (rule-based) system and Sarah Nexus – a statistical (machine learning) system.
How often have you been right?
- Derek Nexus - Historical performance against external training sets – in the case of Derek this performance is at the single alert level rather than globally since if an alert fires, the user wants to know how accurate that single alert is and not performance against all of them.
- Sarah Nexus - Historical performance against external training sets.
How confident are you in your prediction?
- Derek Nexus - Derek uses the logic of argumentation and has different reasoning levels with very specific defined meanings starting at certain (known to the model) and moving down to equivocal (there is evidence both for and against a prediction). We’ve demonstrated the relationship between this measure of confidence and accuracy.
- Sarah Nexus - A measure of confidence which looks at nearest neighbours and asks key questions – how similar the neighbours are (less similar increases uncertainty) and how consistent is the data for those neighbours (if similar compounds show different activities then there must be uncertainty over which side of the ‘activity cliff’ a compound may sit). Using circular fingerprints to capture the context of a fragment gives a measure that correlates well with accuracy.
Have you been wrong with similar predictions?
- Derek Nexus - In Derek, negative predictions are divided into 3 classes – no concern (we’ve seen all parts of the query compound in negative compounds and see no reasons to fire an alert), misclassified (we’ve seen a feature in a compound that we’d have falsely predicted as negative3 and unclassified (we’ve not seen a feature before but have no reason to believe it to be positive4).
- Sarah Nexus - Currently we’d recommend testing similar proprietary compounds and seeing how well the model performs.
What are your reasons for this prediction?
- Derek Nexus - Expert rule-based systems offer real opportunities here since they can provide a mechanistic (chemical and biological) explanation, describe limits of knowledge, areas of uncertainty and limits of the model (including exclusion criteria). This is the closest you can get to talking to a real expert (or team of experts). Each alert in Derek is a mini peer-reviewed article, complete with original references allowing the user to retrace the thinking of our internal experts.
- Sarah Nexus - Generally statistical systems find correlations and not causations. This does then require the user to make that distinction, although a well-designed system will use relevant and appropriate descriptors and algorithms reducing the probability of exposing daft correlations. Sarah uses a bespoke algorithm which allows the creation of local models. This is powerful for several reasons – often there is not a single reason for activity and so finding a global correlation is likely to be unconvincing. Showing the user the local model instantly describes why in terms of ‘this feature is present in more active than inactive compounds’.
Where is the evidence for your prediction?
- Derek Nexus - Supporting data is critical in building trust in a model’s output – it allows the user to independently assess the evidence and compare to their own conclusions. We are privileged to have seen proprietary data when writing Derek alerts and can’t expose the details of this, although we work with the donors to see how much transparency we can give with this additional knowledge – at worst Derek will provide a small number of public examples and describe the context of the data that cannot be shared.
- Sarah Nexus - A statistical model can show all the underlying data which of course is highly transparent. Given enormous training sets it is critical to present this data in an organised and digestible manner. Sarah’s hypotheses (local models) together with a similarity measure derived from circular fingerprints allow the most relevant compounds to be prominent by identifying compounds containing the same fragment in the most similar environments.
How certain are you in your underlying data?
- Derek Nexus - The data against which Derek alerts have been created are extracted by a specialist team and each is peer reviewed (before being stored in our database, Vitic). The protocols of proprietary data are assessed before being used. When an alert writer draws upon this data, any concerns in the completeness or consistency of the data can be assessed while writing the alert. Alerts themselves have all been independently peer reviewed.
- Sarah Nexus - Sarah draws upon the same curated publicly-sourced dataset that Derek uses. Given the limited reproducibility of the Ames study (historically accepted to be around 85%), Sarah takes a conservative view and will use a positive call for those compounds with multiple and conflicting determinations but still allow the user to review these results. Strain information on each compound is available to allow the user to make a judgement as to the certainty in the case of incomplete strain or metabolic activation data being available.
By considering the questions humans may ask one another to establish trust, we can improve the transparency and acceptability of a computer model. Early computer models were often ‘black boxes’ but with skilful design and by ensuring critical questions can be answered by the user, we can move towards models that enable confident, defensible and safe decisions to be made.
Acknowledgement goes to Alessandro Giuliani, Instituto Superiore di Sanita for the inspiring conversation in Rome.
- Actually this is one of the more frustrating features for us as we may well have seen this within proprietary datasets but can’t tell you without revealing something confidential
- After all a compound could be active for more than one reason. Some model builders have taken the decision to either remove those compounds from the training set or to reclassify them before building a statistical model in order to improve model performance but an approach risks undermining trust
- There are many reasons why a compound predicted to be negative may be positive – there may be something an alert writer didn’t know (a truly wrong prediction) or it may be that the original training data is questionable (we are conservative in our approach and treat a compound tested multiple times but only positive once as positive; if we don’t have a strong reason to dismiss result, we would rather tell you than ignore it)
- Often this functionality highlights unusual contexts around the feature of interest – such as a rare ring system