How accurate is the Clinical Trial Risk Tool?

How accurate is the Clinical Trial Risk Tool?

Introduction

People have asked us often, how was the Clinical Trial Risk Tool trained? Does it just throw documents into ChatGPT? Or conversely, is it just an expert system, where we have painstakingly crafted keyword matching rules to look for important snippets of information in unstructured documents?

Most of the tool is built using machine learning techniques. We either hand-annotated training data, or took training data from public sources.

How We Trained the Models inside the Clinical Trial Risk Tool

The different models inside the Clinical Trial Risk tool have been trained on real data, mostly taken from clinical trial repositories such as clinicaltrials.gov (the US trial database which is maintained by the National Library of Medicine). We have sometimes used data which we have tagged ourselves and sometimes used data points provided by investigators as they uploaded their protocols to the repository.

Where possible, we have used simple approaches such as rule based methods (expert systems) or small language models such as Naive Bayes. For example, classifying a protocol into a disease area is a task which can relatively easily be completed using a Naive Bayes text classifier. Very rarely have we resorted to generative AI and sometimes we have used more complex models or we have used ensemble models which combine more than one machine learning approach or a rule based approach together with a machine learning approach to refine it.

We have kept track of the number of documents used to train each model, so that training can be reproduced. The first iteration of the Clinical Trial Risk Tool has been published in Gates Open Research and so more precise documentation of how how each model is trained is also featured in our publication [1].

Where applicable, we have validated models on an independent test data set or we have used cross-validation, which means that we have used the majority of our data set to train a model and then withheld a portion to test it and then switched the testing portion around and then eventually trained a new model on the entire data set. We report metrics such as accuracy but we primarily work with Area Under the ROC curve (AUC). You can see a breakdown of the individual models, accuracies and AUCs below.

Condition

We identify the condition or disease using a Naive Bayes text classifier. This was trained on 1025 protocols from ClinicalTrials.gov which belonged to the following categories.

| CANCER, CF, COVID, DIABETES, EDD, HIV, HYPERTENSION, INFLUENZA, MAL, MND, MS, NTD, OBESITY, PNE, POL, SICKLE, STROKE, TB, other

We used n-fold cross validation, so we trained a machine learning model on all the documents bar one, and then validated its accuracy on the remaining document. We achieved an AUC (area under the curve) of 99% and an accuracy of 88%.

ROC of condition classifier

ROC (Receiver Operating Curve) of the Condition classifier

Confusion matrix of condition classifier

Confusion matrix of the Condition classifier

SAP

We identify whether or not the protocol contains a complete Statistical Analysis Plan using a Naïve Bayes classifier operating on the text of the whole document on word level. In addition, candidate pages which are likely to be part of the SAP are highlighted to the user using a Naïve Bayes classifier operating on the text of each page individually. This model was trained on 33 documents with 578 pages individually annotated.

This model achieved 85% accuracy and 87% AUC.

More information is available in our publication: https://gatesopenresearch.org/articles/7-56/v1 [1].

The training code for the SAP classifier is in our open source repository on Github: https://github.com/fastdatascience/clinical_trial_risk/blob/main/train/train_sap_classifier.py

Effect estimate

We identify the presence or absence of an estimated effect size by using a set of regular expressions and patterns to pick out candidate effect sizes, and then a multinomial Naive Bayes classifier to prioritise them. It was trained on 58 hand annotated documents, where the effect estimate (typically a percentage) was identified and tagged by a human, and it achieved 73% accuracy and 95% AUC on validation data consisting of 15 documents.

A rule-based component written in spaCy identifies candidate values for the effect estimate from the numeric substrings present in the document. These can be presented as percentages, fractions, or take other surface forms. A weighted Naive Bayes classifier which is applied to a window of 20 tokens around each candidate number found in the document, and the highest ranking effect estimate candidates are returned. The values are displayed to the user, but only the binary value of the presence or absence of an effect estimate enters into the risk calculation. [1]

The training code for the effect estimate classifier is in our open source repository on Github: https://github.com/fastdatascience/clinical_trial_risk/blob/main/train/train_effect_estimate_classifier.py

Sample size

We hand-tagged 282 documents with the sample size and wrote a set of manual rules to find candidate sample sizes (typically numbers). We trained a rule-based component written in spaCy to identify candidate values for the sample size from the numeric substrings present in the document. These values are then passed to a random forest classifier, which ranks them by likelihood of being the true sample size, and identifies any substrings such as “per arm” or “per cohort”, which can then be used to multiply by the number of arms if applicable. The model reached 69% accuracy at identifying the number of subjects exactly and 71% accuracy at finding it within a 10% margin. [1]

The training code for the sample size classifier is in our open source repository on Github: https://github.com/fastdatascience/clinical_trial_risk/blob/main/train/train_num_subjects_classifier.py

Countries of investigation

We took 9540 protocols from ClinicalTrials.gov and used these as our main benchmark to train an ensemble of machine learning models to identify the country or countries of investigation. The model combined rule based components to pick out explicitly mentioned country names, and some machine learning, to identify which countries mentioned are really countries of investigation (and not just parts of citations or other false positives). The model achieved AUC 87% on a held out test dataset of 2385 protocols. [1]

We also relied on the Country named entity recognition Python library, developed by Fast Data Science [2].

ROC curve of the country ensemble model

Above: ROC curve of the country ensemble model

The training code for the separate sample size classifiers is in our open source repository on Github:

Simulation

The Clinical Trial Risk Tool identifies whether simulation was used for sample size determination. This is a Random Forest operating on engineered features extracted from the document. This was trained on 49 documents and evaluated on the same 49 using 49-fold cross-validation (so each document was excluded from the training set individually, the model was retrained, and then it was used to validate the model). The model achieved 94% accuracy and 98% AUC when validated using cross-validation.

Above: Feature importances for the simulation model

The training code for the simulation classifier is in our open source repository on Github:

Number of arms

The number of arms is identified using an AI model, specifically an ensemble machine learning and rule-based tool using the NLP library spaCy and scikit-learn Random Forest. If the number of arms cannot be identified exactly, the neural network attempts to get it approximately right by assigning the number of arms into a bin.

The models were trained on 9,538 protocols from ClinicalTrials.gov and achieved 58% accuracy evaluated on a separate 1085 protocols.

ROC curve of the arms neural network model

Above: ROC curve of the arms neural network model

The training code for the separate “number of arms” classifiers is in our open source repository on Github:

Phase

The trial phase is extracted from the text using an ensemble between a convolutional neural network text classifier, implemented using the NLP library spaCy, and a rule-based pattern matching algorithm combined with a rule-based feature extraction stage and a random forest binary classifier, implemented using Scikit-Learn. Both models in the ensemble output an array of probabilities, which were averaged to produce a final array. The phase candidate returned by the ensemble model was the maximum likelihood value.

Both were trained on trained on 9,538 protocols from ClinicalTrials.gov and evaluated on a separate 1085 protocols. The ensemble model achieved 75% accuracy when evaluated on a separate 1,085 protocols.

ROC curve of the phase convolutional neural network model

Above: ROC curve of the phase convolutional neural network model

Confusion matrix of the phase convolutional neural network model Above: Confusion matrix of the phase convolutional neural network model

The training code for the separate phase classifiers is in our open source repository on Github:

Drugs under investigation

The drug names are identified using the Drug Named Entity Recognition Python library [3], and then a machine learning model is used to identify whether a particular drug is under investigation or not.

The machine learning model (AI) was trained on 10,000 protocols from ClinicalTrials.gov, which resulted in 262,018 drug mentions. The model and achieved 97% accuracy at identifying the drugs under investigation when evaluated on 79,056 separate drug name mentions.

Vaccine

The Clinical Trial Risk Tool determines the likelihood of the trial being a vaccine trial using a Naive Bayes model which was trained on 145 protocols from ClinicalTrials.gov and validated on 15 protocols. The AI achieved 87% accuracy and 86% AUC.

ROC curve of the vaccine model

Above: ROC curve of the vaccine model

Confusion matrix of the vaccine model Above: Confusion matrix of the phase vaccine model

References

  1. Wood TA and McNair D. Clinical Trial Risk Tool: software application using natural language processing to identify the risk of trial uninformativeness [version 1; peer review: 1 approved with reservations]. Gates Open Res 2023, 7:56 https://doi.org/10.12688/gatesopenres.14416.1

  2. Wood, T.A., Country Named Entity Recognition [Computer software], Version 0.4, accessed at https://fastdatascience.com/country-named-entity-recognition/, Fast Data Science Ltd (2022)

  3. Wood, T.A., Drug Named Entity Recognition [Computer software], Version 1.0.3, accessed at https://fastdatascience.com/drug-named-entity-recognition-python-library, Fast Data Science Ltd (2024)

Can you trust clinical trial design software? Top solutions in 2025

Can you trust clinical trial design software? Top solutions in 2025

Over the years, the overall cost of the drug development process has been exponentially increasing, prompting the adoption and use of adaptive clinical trial design software. Though there are practical difficulties and barriers in implementing clinical trial solutions, these problems are adequately addressed to overcome these issues as they arise. With advancements in software technologies, further improvements are being made to the software’s adaptive clinical trial design. Despite these progresses, just only a handful of well-established software with various types of clinical trial adaptations is currently available.

Clinical trial protocol review methods and workflows

Clinical trial protocol review methods and workflows

A clinical trial protocol is a document which serves as the step-by-step playbook for running the trial. The clinical trial protocol guides the study researchers to run the clinical trial effectively within a stipulated period. The prime focus of the clinical trial protocol is to ensure patients’ safety and data security. [1, 2] As the clinical trial protocol is an essential document for the seamless execution of the clinical trial, reviewing (peer-reviewing) the protocol is essential to ensure the scientific validity/viability/quality of the protocol.

Clinical trial regulations in 2025: navigating the constraints

Clinical trial regulations in 2025: navigating the constraints

Guest post by Safeer Khan, Lecturer at Department of Pharmaceutical Sciences, Government College University, Lahore, Pakistan Introduction As we move toward 2025, clinical trial regulations are undergoing significant transformation. This shift is being fueled by technological advancements, changing healthcare needs, and an increasing emphasis on transparency and patient safety. In this post, we will explore the key clinical trial regulations shaping the clinical trial landscape, the challenges professionals face, and the strategies they must adopt to navigate this ever-evolving environment.