Lhasa Limited shared knowledge shared progress

Sarah Model Builder

Shared knowledge, shared progress - working together for mutual benefit


  • Model Builder - The model builder feature means users can now duplicate the Lhasa model and supplement it with their own data, or build an entirely new model. These custom models can then be exported and shared with other people.
    • Custom models can be fine-tuned for a specific set of data using the heat map function. External Validation produces a heat map which shows the optimum Sensitivity and Equivocal settings for the particular chemical space represented by the external dataset.
    • New data added to a model, or used to create a brand new model, is standardised. Following standardisation, all conflicting and duplicate structures are detected and dealt with.
  • Dataset - The dataset used to train the Lhasa model in Sarah has been expanded to over 9,000 compounds, and now includes the CFSAN and Vitic Expert Call datasets.
  • Improved Structural Reasoning -The training dataset has been improved using new structure standardisation and normalisation techniques. These improvements include:
    • Resonance forms and tautomers have been standardised.
    • Only counterions and salts from a set list will be removed from mixtures. This will prevent the removal of important components of mixtures.
  • Improved Machine Learning - Rules have been added to Sarah allowing the creation of new hypotheses in the self-organising hypothesis network (SOHN). During fragmentation, atomic environment information is captured for each atom of each structure in the training set. This information can be used to create more general hypotheses during the building of a model. The generality of these hypotheses means they can be utilised to make predictions for a variety of query compounds and hence improve the accuracy of these predictions. Figure 1 shows an example of a generalised aromatic hypothesis.

Figure 1: A screenshot of a generalised aromatic hypothesis


  • Enhanced Predictivity for Proprietary Chemical Space - The Sarah model builder enables users to supplement Sarah’s known chemical space with their own proprietary data to enhance predictivity within the chemical space of the user's query compound.
  • Reduced Bias - Duplicating the Sarah model and then adding your own data to create a new model can reduce bias in a smaller dataset.
  • Generate the Best Model for a Specific Dataset
    • Using the heat map function, the optimum Sensitivity and Equivocal parameters can be found for a custom model. This means the model can be fine-tuned to produce the best overall settings for a particular dataset to be tested within Sarah.
    • Standardisation and validation of the data within a model leads to fewer duplicates in a training set. These duplicates can lead to false strengths in signals, therefore reducing them provides experts with a more accurate prediction.




Model Builder

There are five simple steps to take when building a model in Sarah Nexus:

  • Add New Data- Import a dataset to base the model on, or import the data to supplement an existing model.
  • Conflict/Duplicate - The data is processed to remove any conflicting or duplicated data. Following standardisation, if there are any compounds within your dataset with the same or conflicting results, these will be detected and dealt with. Those with conflicting results are removed and those with the same results will be reduced to a single entry. If you are creating a brand new model, the standardisation techniques can be altered from the default techniques.
  • Model Settings - The user can name the model and decide which reasoning type is used (weighted/ single most confident/ conservative). The Equivocal and Sensitivity settings can be defined.
  • Results & Cross-Validation - This tab details the hypotheses used within the model, a summary of the constraints, the structures used in the dataset, and the cross-validation results.
  • External Validation - Users can test their model against external datasets. The results are displayed as a colour-coded heat map, which shows the best Equivocal and Sensitivity settings.

 The following charts are displayed for both the cross-validation and the external validation:

  • Performance - A pie chart showing the distribution of the results into five categories: true positive, true negative, false positive, false negative, outside domain. (Figure 1)
  • ROC (Receiver Operator Characteristics) - A plot of the true positive rate against the false positive rate, showing the trade-off between sensitivity and specificity. (Figure 2)
  • Accuracy - This plot shows the accuracy, sensitivity and specificity against the confidence. (Figure 3)

The following screenshots are of the Performance, ROC and Accuracy Graphs from the Cross-Validation of Sarah Model 1.1.19

Figure 1: Performance Pie Chart for the Cross-Validation of Sarah Model 1.1.19

Figure 2: ROC graph for the Cross-Validation of Sarah Model 1.1.19

Figure 3: Accuracy Graph for the Cross-Validation of Sarah Model 1.1.19

Contact Us

© 2017 Lhasa Limited | Registered office: Granary Wharf House, 2 Canal Wharf, Leeds, LS11 5PS, UK Tel: +44 (0)113 394 6020
VAT number 396 8737 77 | Lhasa Limited is registered as a charity (290866)| Company Registration Number 01765239 (England and Wales).

QuestionPro supports sample survey questions such as multiple choice, drop-down menu, likert-scale, semantic differential, matrix, constant sum, drag-and-drop, slider-scale, net-promoter scale, and many more question types.