Data Dos and Don’ts In Building Statistical Models For Ames Mutagenicity
The ICH M7 guidance relating to the detection and control of potentially mutagenic impurities in drug substances places statistically-based QSAR systems in a key position in the decision making process.
It is, therefore, extremely important that any predictions made by modelsused in this role show high levels of accuracy and transparency. The underlying algorithms used to build these models undoubtedly play a major role in determining both. However, the
data used to train them will also have a significant effect on how well they perform.
Consequently, the quality and quantity of this data should be carefully considered before any model is built. With this in mind we undertook investigations into the various aspects of data which can affect statistical model performance and how these can be optimised to improve a statistical model.
The statistical system, Sarah Nexus, produced by Lhasa Limited, was used in these studies along with a large training set built from publicly available Ames mutagenicity data in order to make the findings as general as possible. Two key aspects of the data sets used to build the models were investigated. Firstly, the structural representation used to define the substance which has been tested was considered. In addition, the quality of the biological results associated with each substance in the training set was also