Clinical Trial Risk

NLP in pharma: modelling clinical trials risk from unstructured text

When a pharmaceutical company develops a drug, it needs to pass through several phases of clinical trials before it can be approved by regulators.

Before the trial is run, the drug developer writes a document called a protocol. This contains key information about how long the trial will run for, what is the risk to participants, what kind of treatment is being investigated, etc.

The problem is that each protocol is up to 200 pages long and the structure can vary.

We have developed and trained an AI tool using natural language processing (NLP) to predict 8 output variables from a clinical trial protocol.

This allows pharma companies and regulators to analyse and quantify large numbers of clinical trial protocols, allowing more accurate risk estimation.

The technique can be extended to other industries where large unstructured or semi-structured documents are the norm.

AI has great potential to revolutionise many aspects of the pharmaceutical industry, from pre-clinical stages such as in silico drug discovery through to clinical trials and aftermarket monitoring of key opinion leaders (KOLs).
Natural language processing has found uses in the pre-clinical, clinical and KOL stages of the drug development lifecycle.

Natural Language Processing to predict Complexity and Cost in the Pharmaceutical Industry

Fast Data Science Ltd (developers of the Clinical Trial Risk Tool) have also worked on a similar natural language processing project for a pharmaceutical company which needed to predict the overall complexity and cost of running a clinical trial. We developed a web-based tool which allowed a non-technical user to drag and drop a PDF file of a clinical trial protocol. The tool converted the PDF to raw text and extracted a number of key properties of the trial, such as the number of subjects, location, number of visits, type of treatment, and so on. These properties were fed into a risk model which rated the trial as low, medium or high complexity in three dimensions (subject score, study score, and site score).