In order to develop the Clinical Trial Risk Tool, we had to conduct a quality control exercise on the components. Each parameter which is fed into the complexity model is trained and evaluated independently. For an overview of how the tool works, please read this blog post.
I used two datasets to train and evaluate the tool:
By combining the two datasets I was able to get some of the advantages of a large dataset and some of the advantages of a smaller, more accurate dataset.
For validation on the manual dataset, I used cross-validation. For validation on the ClinicalTrials.gov dataset, I took the third digit of the trial’s NCT ID. Trials with values 0-7 were used for training, with value 8 were used for validation, and those with value 9 are held out as a future test set.
Validation scores on small manually labelled dataset (about 100 protocols labelled, but 300 labelled for number of subjects). You can reproduce my experiments using the notebooks from this folder.
Component | Accuracy – manual validation dataset | AUC – manual validation dataset | Technique |
Condition (Naive Bayes: HIV vs TB) | 88% | 100% | Naive Bayes |
SAP (Naive Bayes) | 85% | 87% | Naive Bayes |
Effect Estimate | 73% | 95% | Naive Bayes |
Number of Subjects | 69% (71% within 10% margin) | N/A | Rule based combined with Random Forest |
Simulation | 94% | 98% | Naive Bayes |
You can reproduce my experiments using the notebooks from this folder. As a sanity check I also trained a Naive Bayes classifier for some of these components to check that our models are outperforming a reasonable baseline.
Component | Accuracy – ClinicalTrials.gov validation dataset | Baseline Accuracy (Naive Bayes) – ClinicalTrials.gov validation dataset | Technique |
Phase | 75% | 45% | Ensemble – rule based + random forest |
SAP | 82% | Naive Bayes | |
Number of Subjects | 13% | 6% | Rule based combined with Random Forest |
Number of Arms | 58% | 52% | Ensemble |
Countries of Investigation | AUC 87% | N/A | Ensemble – rule based + random forest + Naive Bayes |
In particular I found that the ClinicalTrials.gov value for the sample size was particularly inaccurate, hence the very low performance of the model on that value.
By far the most difficult model was the number of subjects (sample size) classifier.
I designed this component as a stage of manually defined features to identify candidate sample sizes (numeric values in the text), combined with a random forest using these features to identify the most likely candidate. Here is an output of the feature importances of the random forest model.
Similarly, the simulation classifier is a random forest that uses manually defined features of key words:
For any of the components, we also plotted a Confusion Matrix.
Confusion matrix for the baseline (Naive Bayes) phase extractor on the ClinicalTrials.gov validation dataset
Since each component is designed differently, it has been complex to validate the performance for clinical trial risk assessment.
However I have provided some Jupyter notebooks in the repository to run the validation and reproduce my results.
There is still much scope for improvement of several features, especially sample size.
Some parameters, such as simulation, were not available in the ClinicalTrials.gov dataset and so could only be trained and validated manually. We hope to be able to annotate more data for these areas.
Clinical trial designs vary considerably, impacting study execution, patient recruitment, endpoints, and treatment delivery. Here’s a brief summary of some common designs: First-In-Human (FIH) Studies These are the initial human trials for a new drug, procedure, or treatment, focusing primarily on safety. Cohort Studies These observational studies follow a group of individuals over an extended period to assess risk factors associated with developing specific conditions. Case-Control Studies These studies compare individuals with a particular disease or condition (cases) to similar individuals without the disease (controls) to identify potential risk factors.
This post originally appeared on Fast Data Science’s blog on LinkedIn. The Growing Role of AI in Clinical Trials Clinical trials are vital for advancing medicine, but managing them efficiently is a constant challenge. Traditional methods for assessing risks and estimating costs often miss the mark, leading to delays and unexpected expenses. This is where Artificial Intelligence (AI) and Natural Language Processing (NLP) come into play, offering smarter, data-driven solutions to streamline trial planning and management.
This post originally appeared on Fast Data Science’s blog on LinkedIn. Budgeting is one of the most critical steps when planning a clinical trial. Clinical trials are complex, multi-phase studies that require significant resources, and understanding the costs associated with each phase is crucial for an accurate clinical trial budget. In this post, we’ll explore the different phases of clinical trials and the key factors that influence their costs, providing insights into how to prepare a comprehensive budget that aligns with your trial’s needs.