We have developed a tool allowing researchers to analyse HIV and TB Clinical Trial Protocols and identify risk factors using Natural Language Processing. The tool allows a user to upload a Clinical Trial Protocol in PDF format, and the tool will generate a risk assessment of the trial. You can find example protocols by searching on ClinicalTrials.gov.

Details of this proof of concept

  • The POC stage is limited to 2 pathologies: HIV and TB.
  • The current prototype is designed for trials in LMIC countries.
  • Phases 1, 2, 3 and 4 included.
  • The project was coded in Python. A future project could involve porting it to R.

How to use the tool

The tool allows a user to upload a trial protocol in PDF format. The tool processes the PDF into plain text and identifies features which indicate high or low risk of uninformativeness.

At present the tool supports the following features:

  • Pathology
  • Phase
  • Is SAP present?
  • Effect estimate disclosed?
  • Number of subjects?
  • Number of arms?
  • Countries of investigation
  • Trial uses simulation for sample size?

The features are then passed into a scoring formula which scores the protocol from 0 to 100, and then the protocol is flagged as HIGH, MEDIUM or LOW risk.

How the tool works

The Protocol Analysis Tool runs on Python and requires or uses the packages Plotly Dash, Scikit-Learn, SpaCy and NLTK. The tool runs as a web app in the user’s browser. It is developed as a Docker container and it has been deployed to the cloud as a Microsoft Azure Web App.

PDFs are converted to text using the library Tika, developed by Apache.

All third-party components are open source and there are no closed source dependencies.

A list of the accuracy scores of the various components is provided here.

Very quick guide to running the tool on your computer

Download this repository from the Github link as in the below screenshot, and unzip it on your computer

Alternatively if you are using Git in the command line,

  1. Ill Git.
  2. Install Git LFSThis is important – if you just have regular Git installed then only part of the repository will not be cloned, as the large binary files are stored in Git LFS (Large File Storage).
  3. Run this command: git clone git@github.com:fastdatascience/clinical_trial_risk.git

Now you have the source code. You can edit it in your favourite IDE, or alternatively run it with Docker:

  1. Install Docker.
  2. Install Docker Compose.
  3. Open a command line or Terminal window. Change folder to where you downloaded and unzipped the repository, and go to the folder front_end. Run the command: docker-compose up
  4. You can now view the tool in your browser

Tool architecture


Each parameter is identified in the document by a stand-alone component. The majority of these components use machine learning but three (Phase, Number of Subjects and Countries) use a combined rule-based + machine learning ensemble approach. For example, identifying phase was easier to achieve using a list of key words and phrases, rather than a machine learning approach.

How the components work

  • Pathology (condition) – this is a Naïve Bayes classifier operating on the text of the whole document on word level. It classifies documents into HIV, TB, or Other. It treats HIV and TB as mutually exclusive, although in the next stage of the project more pathologies could be covered and the tool could assign a document to multiple pathologies. To develop this, protocols were manually tagged as HIV, TB or other and the tool learnt which words are indicative of which pathology.
  • Phase – this is a rule-based tool using the NLP library spaCy, combined with a random forest model to rank candidate phases.
  • SAP – this is a two-stage classifier consisting of a Naïve Bayes classifier operating on the text of each page individually, and a random forest classifier which takes the output of the first classifier and categorises the entire document as SAP or not SAP.
  • Effect Estimate – this is a weighted Naïve Bayes classifier which is applied to a window of 20 tokens around each number found in the document.
  • Number of subjects – this is a machine learning and rule-based tool using the NLP library spaCy and scikit-learn Random Forest.
  • Number of arms – this is a machine learning and rule-based tool using the NLP library spaCy and scikit-learn Random Forest.
  • Countries of investigation – this is a rule based tool using regular expressions.
  • Simulation for sample size – This is a Naïve Bayes classifier operating on the text of each page individually. If a page contains information about simulation used for sample size, the classifier classifies that page as 1, otherwise as 0. If any page in the whole document is classified as class 1, then the protocol is considered to have used simulation for sample size determination. Although trials may use simulation at various points, the data tagged for simulation includes only trials using simulation specifically for sample size planning. Trials using simulation for later stages of statistical analysis are excluded.

Sample size tertiles

The default sample size tertiles were derived from a sample of 21 trials in LMICs, but have been rounded and manually adjusted based on statistics from ClinicalTrials.gov data.

The tertiles were first calculated using the training dataset, but in a number of phase and pathology combinations the data was too sparse and so tertile values had to be used from ClinicalTrials.gov. The ClinicalTrials.gov data dump was used from 28 Feb 2022.

Future work

Future development work on this project could include:

  • Broadening the scope to more pathologies.
  • Support for multi-document protocols (e.g. Protocol and SAP in separate PDFs)
  • Support for processing of multiple documents at the same time.
  • Expand scope to cover cost, complexity or other metrics. Implement further features.
  • Improving accuracy and coverage of some of the existing features.
  • If the number of sites or cohorts can be identified, this allows the number of subjects to be calculated where not explicitly stated.
  • If the NCT # is found in the protocol, sample size data can be retrieved from ClinicalTrials.gov API.
  • A list of candidate features for future work is given below:
  • Number of sites
  • Primary duration
  • Number of primary endpoints
  • Prevalence estimate not disclosed
  • Is a master protocol or a subset or derivative of a master protocol
  • Is part of a platform trial
  • Number of visits
  • Duration of trial
  • Multiple sites in a single country trial
  • Number of countries with at least one site
  • Uses model-informed drug development
  • Tertile of primary duration
  • Patient consortium or trial consortium prominently involved
  • Is an adaptive design
  • Takes place in a hospital
  • phase-in-domain
  • Recency of protocol vs today’s date
  • Recent dates in prevalence/burden citations
  • Indicates intention or willingness to make changes at interim
  • Number of trial sites in entire trial /
  • Number of procedures
  • Includes analysis of real-world data
  • More than 1 drug in the intervention cocktail
  • Number of mentions of the word policy
  • Case report form pages – all trial
  • Case report form pages per variable
  • Duration of follow up (in months)
  • External sponsorship
  • Non-standard endpoint
  • Trial uses cluster sampling
  • No trial database used
  • High number of follow-up appointments
  • Strict recruitment criteria (age, medical history)
  • Crossover design
  • Multiple consents, tests and forms for participants to fill out
  • Multiple randomisation steps
  • Extended investigational treatment or lengthy regimen until progression
  • Low disease prevalence
  • Trial takes place in hospital
  • Trial is a platform trial
  • Trial has sub-studies
  • Trial used model informed approach
  • Complex age criteria in recruitment


We have identified the potential for natural language processing to extract data from protocols at BMGF. Both machine learning and rule-based methods have a huge potential for this problem, and machine learning models wrapped inside a user-friendly GUI make the power of AI evident and accessible to stakeholders throughout the organisation.

With the protocol analysis tool, it is possible to explore protocols and systematically identify risk factors very quickly.

Leave a Reply

Your email address will not be published. Required fields are marked *