Curating Truth with AI Using Causal Data to Empower Drug Discovery

Chen Americana
Mar 5, 2021
5 min read

Speaker Introduction - Dr Daniel Jamieson

Dr Daniel Jamieson is the CEO and Founder at Biorelate, he founded Biorelate during his PhD in computational biology at the University of Manchester. After having supported the successful identification of drug repurposing opportunities with Pfizer, Dr Jamieson is now focusing on growing Biorelate into a world-leading enterprise, helping pioneering companies in their mission to develop life-saving innovations. He co-developed Galactic AITM, a supercomputing platform that automatically curates biomedical research to dramatically improve the understanding of a particular research area in drug discovery. By connecting obfuscated evidence, Biorelate’s goal is to accelerate the development of important new therapies.

Current Pain Point of Biomedical Research Curation

Drug discovery has been highly dependent on research publication. Nonetheless, most medical research curations are still heavily reliant on manual processes. It is estimated that curating a single article takes on average 219 USD and,given that there are about 100M potential biomedical articles in the public domain, it would take 21.9 Billion USD to curate them without accounting for the fact that global scientific output doubles every 9 years. The challenge doesn’t only lie in identifying the “keywords” in the research articles, manual curation is also incapable of capturing cause and effect relationship between data in the massive amount of documented research.

The problems of curation in drug discovery involves 3 primary stakeholders. Data owners, typically pharmaceutical companies that try to understand the data they own and the data available in the public domain; data scientists who want to use the data to perform algorithms; biologists at the end of this chain who try to develop critical insights from the algorithm outputs to answer research questions. The acceleration and automation of the curation process would be extremely beneficial for all stakeholders involved in the drug discovery chain.

Artificial Intelligence as a Solution to Drug Curation

The solution provided by Biorelate was inspired by Dr. Jamieson’s previous work on discovering solutions for chronic pain diseases. He tried to pull out all cause and effect relationships that have ever been documented in the context of chronic pain, capturing 93.2 K molecular interactions and assembling a network for analysis. The network of relationships is then used to explore potential drug targets, find repurposing opportunities and validate relevant hypotheses. 66 compounds were selected for novel validation and the results shows that the hit rate is as high as 42%, double that of the 107 compounds selected through expert opinion.

Biorelate’s AI technology tries to generalize this idea and apply it to all fields of drug discovery and biomedical research with its Galactic AI solution. Utilizing Deep learning and Natural Language Processing AI technologies, Galactic AI first performs concept annotation. Annotation of 6.3 billion concepts in the biomedical area are performed where the algorithm takes a particular concept and normalizes it against the gold stand ontology in the public domain.

The more novel part comes at the next stage, which is capturing the cause-and-effect relationship between these concepts. Over 153 million causal interactions between genes, chemicals, drugs, cells, phenotypes and more can be established automatically , each causal interaction is given a confidence score that measures the likely precision. Galactic AI managed to achieve an extremely high confidence score of 95%, sometimes even considered to be a better performance than manual processes.

The improvement in efficiency is dramatic, over 30 million articles could be routinely auto-curated in just under 6 hours, which makes documented biomedical research much more accessible to researchers.

How can we use the data?

The interactions between concepts can be used through various different channels:

Galactic Web Search. The Galactic Web allows you to search for a concept on live. For example, if you would like to carry out research on a particular protein, you could get all kinds of relationships that it has with other concepts beforehand, which gives you an idea of how this protein regulates different areas of biology.
Investigating relationships between specific concepts. By connecting the data together, an outward branching network can be built to analyse all different pathways that connect a few specific concepts together, such as all relationships between two specific proteins.
Comprehensive overview. Rather than taking a node and looking outward, we could also take a top down approach to look at specific disease areas as a whole and construct a network that examines all different cause-and-effect relationships which have been documented in its context. Valuable information such as the number of documents that have been published that records the relationship, the type of the relationship, and degree of relationship between various concepts can be obtained by deploying ranking algorithms.

How does it actually work? Lung Cancer Case Study

To make it more concrete, let’s look at an example of its application in the field of lung cancer. In the first stage, every single documented cause and effect relationship under the context of lung cancer is automatically identified and annotated based on 54 related child concepts in disease ontology. This sums up to 202K protein-protein interactions and 71K distinct directed interactions with 7.8K distinct proteins.

In order to identify potential novel lung cancer targets, we then build a classifier which ranks the targets using graphical features to identify the ones that are more likely to be lung cancer targets. We overlay existing lung cancer drug targets from ChEMBL, a database containing data regarding bioactive molecules with drug-like properties, and by placing them on the same knowledge graph we rank proteins in terms of their likelihood to represent a lung cancer drug target. The scorer types are typical scoring systems used in graph theories, including Centrality measures, Enrichment, Max Relevance and Max Confidence, and it turns out that the best metric Eigenvector centrality achieved an 18% average precision. Considering that a positive prediction means a known lung cancer target, 18% is actually quite high, as the false positives may actually be true positives but they are not yet to be licensed on the market today.

Using the ranking outcome, researchers could focus on studying the potential targets, biomarkers and disease mechanisms, narrowing down the scope and accelerating the research process.

In the next step of investigation, we ask the question: can we predict the novel targets?

To do this, we could exclude the 55.1K relationships published after 2010 and use the algorithm to predict viable future lung cancer targets that came out after 2010. Comparing the output of the algorithm with the reality using a Mann Whitney U test, the targets of drugs approved after 2011 are significantly more likely to be scored higher with p=0.08, reinforcing the viability of the novel lung cancer target prediction algorithm.

How does the curated data support drug discovery and why is it important in the future

The application of insights can be utilized in different stages by different entities involved in drug discovery. For early phase drug discovery projects by small to medium sized pharmaceutical companies, they have less technical capabilities to take in house AI base data curation, thus providing these data for them allows them to understand about how the drugs they’re developing could be optimized to save money further downstream, preventing them from going down routes that are more likely to fail. Looking at existing literature curated by the algorithm, they will be able to make more informed selection of R&D targets, understand the gaps, trends and opportunities, and better identify their focus of research.

For larger pharmaceutical companies many of them are building more contextually aware knowledge and this is where relationship and knowledge maps constructed by AI technology can come into play. Providing them with more comprehensive insights into existing data, helping them to investigate potential repurposing and repositioning opportunities and finding the top ranked indications impacted by their own drugs, fueling more efficient discovery processes.

As more academias, researchers and pharmaceutical companies contribute to the library of thoughts and discovery in the biomedical field, the need for us to develop an automated approach to extract valuable data from these documentation should be prioritised. Data of causal interactions can be very powerful in investigating the wealth of knowledge in biomedicine we have amassed,