Having one disease may increase your risk for another disease in the future. (Tomatheart/Shutterstock)
- Researchers identified over 8,000 causal relationships between diseases by mining scientific literature and validating the findings with real-world patient data from the UK Biobank.
- These disease-to-disease links were used to build a directed acyclic graph (DAG) that improves genetic risk predictions, especially for diseases lacking their own genetic data.
- The study’s approach could help doctors better anticipate complications, refine risk scores, and even repurpose treatments by understanding how one disease leads to another.
THUWAL, Saudi Arabia — Did you know that smoking could lead to lung cancer? Or that untreated diabetes might cause blindness? This is what doctors call causal relationships between conditions. Documenting which diseases directly cause others has been a major challenge for medical researchers. Now, new international research has changed that.
In the study published in Bioinformatics, researchers developed an automated method for extracting causal relationships between diseases from scientific literature and created a map showing which conditions lead to others. This knowledge is already improving how scientists calculate genetic risk scores that predict one’s likelihood of developing specific diseases.
The Domino Effect of Disease
Most people know that Type 2 diabetes can lead to complications. However, the exact sequence—diabetes causing hyperglycemia, which causes microvascular disease, ultimately resulting in diabetic retinopathy—illustrates the domino effect one condition can have. Understanding these chains helps doctors anticipate problems before they develop and potentially intervene earlier.
The research team used sophisticated text mining techniques to scour thousands of medical journal abstracts. They weren’t just looking for diseases that commonly occur together (comorbidities) but specifically for statements asserting that one disease directly causes another. The team identified 8,191 unique causal relationships spanning 1,860 different disease categories.

Untreated diabetes cases can lead to complications and the development of other diseases. (© Feng Yu – stock.adobe.com)
To validate their findings, they cross-referenced them with real-world patient data from the UK Biobank, a massive database containing health information from over 500,000 participants. They checked whether diseases that supposedly had causal relationships showed statistical connections in actual patients and whether the timing of diagnoses matched expectations (cause preceding effect).
Better Risk Prediction
Researchers then transformed their findings into a mathematical structure called a directed acyclic graph (DAG). This allowed scientists to perform causal inference, a sophisticated form of analysis that goes beyond mere correlation to understand true cause-and-effect relationships.
When the researchers added their disease map to genetic risk scores, which estimate your chances of getting a disease based on your DNA, they found it made predictions more accurate. For example, combining risk scores for related conditions, like heart disease and the problems it can lead to, helped them better predict who might develop heart issues.
Untangling the Complex Web
Doctors could use this map of diseases to predict risks for conditions lacking extensive genetic data by analyzing the genetic risks of diseases that cause them. This method also helps untangle a common problem in genetics called pleiotropy, where one gene appears to influence several different conditions that don’t seem connected.
By uncovering hidden links between diseases, the AI-powered tool created by KAUST researchers reveals how treating one illness could help prevent another. (Credit: © 2025 KAUST)

The research team found that many genetic variants previously thought to independently influence multiple diseases actually follow causal chains, affecting one disease and then causing another. More targeted treatments could be developed that address the root cause rather than just the symptoms.
This method can automatically analyze thousands of gene-disease combinations, which could change how we understand the links between our genes and different health conditions.
All the data, including the disease dictionary, full network of relationships, and the disease graph, are freely available through GitHub, allowing other researchers to build upon this foundation.
Diseases are complexly intertwined. By mapping the causal connections between conditions, scientists now have a powerful new tool to improve risk prediction, understand disease chain reactions, and potentially create more effective treatments that address the true origins of illnesses rather than waiting until they develop.
The researchers developed a method to identify causal relationships between diseases from scientific literature. They used lexical patterns to extract statements asserting that one disease causes another from PubMed abstracts. Diseases were mapped to International Classification of Diseases (ICD-10-CM) codes for standardization. To validate these relationships, they employed multiple measures: statistical correlation in UK Biobank patient data, dependence testing, temporal sequence of diagnosis dates, number of literature mentions, and confirmation from GPT-4, a large language model. These measures were combined into a confidence score for each relationship. They then created a directed acyclic graph (DAG) by ranking relationships by their scores and removing those creating cycles, resulting in a network with 1,860 nodes (diseases) and 7,589 edges (causal relationships).
The study identified 8,191 unique causal relationships between diseases from 16,808 sentences in medical literature. When compared to random disease pairs, the causal relationships showed significantly stronger statistical associations in real patient data. The temporal sequence analysis confirmed that in most relationships, the cause was diagnosed before the effect. Manual expert validation estimated an 84% accuracy rate for the mined relationships. When applied to polygenic risk scores (PRS), the causal network improved prediction accuracy. For example, combining the genetic risk scores for coronary heart disease with its effects (heart failure, myocardial infarction, and angina) increased prediction performance by 7.9-22.9%. The network also helped identify genetic variants whose effects on certain diseases were mediated through other conditions, clarifying complex genetic relationships.
The research acknowledges several limitations. Publication bias may exist in the literature, causing some diseases and relationships to be over or under-represented. The text mining approach may generate false positives from complex semantic structures in scientific text and false negatives from missed disease mentions or undetected linguistic patterns. The validation primarily used UK Biobank data, which has known demographic limitations. Additionally, the approach works only for binary disease outcomes rather than quantitative traits, and the scoring system weights all measures equally, which may not be optimal.
The research was conducted using the UK Biobank Resource under Application Number 31224 and supported by funding from King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research under various award numbers. Additional support came from the SDAIA–KAUST Center of Excellence in Data Science and Artificial Intelligence, the KAUST Center of Excellence for Smart Health, and the KAUST Center of Excellence for Generative AI. The research using UK Biobank data was approved by the Institutional Bioethics Committee at KAUST.
The paper titled “Causal relationships between diseases mined from the literature improve the use of polygenic risk scores” was published in Bioinformatics (Volume 40, Issue 11, Article btae639) on October 26, 2024. The authors include Sumyyah Toonsi, Iris Ivy Gauran, Hernando Ombao, Paul N. Schofield, and Robert Hoehndorf, who are affiliated with King Abdullah University of Science and Technology and the University of Cambridge. The data is open access and available here.