We have developed a tool allowing researchers to analyse HIV and TB Clinical Trial Protocols and identify risk factors using Natural Language Processing. The tool allows a user to upload a clinical trial protocol in PDF format, and the tool will generate a risk assessment of the trial. You can find example protocols by searching on ClinicalTrials.gov.
The tool allows a user to upload a trial protocol in PDF format. The tool processes the PDF into plain text and identifies features which indicate high or low risk of uninformativeness.
At present the tool supports the following features:
The features are then passed into a scoring formula which scores the protocol from 0 to 100, and then the protocol is flagged as HIGH, MEDIUM or LOW risk.
The Protocol Analysis Tool runs on Python and requires or uses the packages Plotly Dash, Scikit-Learn, SpaCy and NLTK. The tool runs as a web app in the user’s browser. It is developed as a Docker container and it has been deployed to the cloud as a Microsoft Azure Web App.
PDFs are converted to text using the library Tika, developed by Apache.
All third-party components are open source and there are no closed source dependencies.
A list of the accuracy scores of the various components is provided here.
Download this repository from the Github link as in the below screenshot, and unzip it on your computer
Alternatively if you are using Git in the command line,
Now you have the source code. You can edit it in your favourite IDE, or alternatively run it with Docker:
front_end
. Run the command: docker-compose upEach parameter is identified in the document by a stand-alone component. The majority of these components use machine learning but three (Phase, Number of Subjects and Countries) use a combined rule-based + machine learning ensemble approach. For example, identifying phase was easier to achieve using a list of key words and phrases, rather than a machine learning approach.
The default sample size tertiles were derived from a sample of 21 trials in LMICs, but have been rounded and manually adjusted based on statistics from ClinicalTrials.gov data.
The tertiles were first calculated using the training dataset, but in a number of phase and pathology combinations the data was too sparse and so tertile values had to be used from ClinicalTrials.gov. The ClinicalTrials.gov data dump was used from 28 Feb 2022.
Future development work on this project could include:
We have identified the potential for natural language processing to extract data from protocols at BMGF. Both machine learning and rule-based methods have a huge potential for this problem, and machine learning models wrapped inside a user-friendly GUI make the power of AI evident and accessible to stakeholders throughout the organisation.
With the protocol analysis tool, it is possible to explore protocols and systematically identify risk factors very quickly.
On 8 October, Thomas Wood of Fast Data Science presented the Clinical Trial Risk Tool, along with the Harmony project, at the AI and Deep Learning for Enterprise (AI|DL) meetup sponsored by Daemon. You can now watch the recording of the live stream on AI|DL’s YouTube channel below: The Clinical Trial Risk Tool leverages natural language processing to identify risk factors in clinical trial protocols. The initial prototype Clinical Trial Risk Tool is online at https://app.
Shining a Light on Clinical Trial Risk: Exploring Clinical Trial Protocol Analysis Software Clinical trials are the backbone of medical progress, but navigating their design and execution can be complex. Fast Data Science is dedicated to helping researchers by analysing clinical trial protocols through the power of Natural Language Processing (NLP). We are presenting a selection of software which can be used for clinical trial protocol analysis or clinical trial cost prediction and risk assessment.
Understand your clinical trials Are your clinical trials risky? Are costs running away? It’s very tricky to estimate clinical trial costs before a trial is run. Try our free cost calculator. This is a regression model, trained on real clinical trial data. Trial is for condition HIV Tuberculosis COVID Influenza Malaria Enteric and diarrheal diseases Neglected tropical diseases Polio Diabetes Pneumonia Hypertension (see full product) Motor neurone disease (see full product) Multiple sclerosis (see full product) Obesity (see full product) Sickle cell anemia (see full product) Stroke (see full product) Cystic fibrosis (see full product) Cancer (see full product) Other (see full product) Phase Early Phase 1 1 1.