Building models of bacterial mutagenicity from biased training data.pdf
Models for predicting bacterial mutagenicity are now widely used by pharmaceutical sponsors to assess the genotoxic potential of impurities in pharmaceutical products. Models built using machine learning (ML) techniques are commonly trained using balanced datasets where, in this case, equal numbers of compounds are positive and negative for mutagenicity. Building accurate models using ML from biased training data – unequal numbers of positive and negative compounds – can be a challenge. Sarah Nexus is a program for predicting bacterial mutagenicity that uses a self-organising hierarchical network (SOHN). Hitherto SOHN models have been built using data that have little bias; however, if models are built using biased training data, then there is a need to ensure that the model learns sufficiently well about the minor class. If the dataset is biased towards negative compounds, this would result in a model for mutagenicity with depressed sensitivity.
Presented by Chris Barber at SOT, San Diego, USA; 22nd - 26th March 2015.