Skip to content

When you choose to publish with PLOS, your research makes an impact. Make your work accessible to all, without restrictions, and accelerate scientific discovery with options like preprints and published peer review that make your work more Open.

PLOS BLOGS The Official PLOS Blog

Treating diseases more precisely: Disentangling large-scale disease association data

Note: PLOS is delighted to once again partner with the Einstein Foundation Award for Promoting Quality in Research. The awards program honors researchers who reflect rigor, reliability, robustness, and transparency in their work. The Einstein Foundation received dozens of stellar submissions. We asked this year’s finalists to write about their research in the run up to the ceremony on March 14th in Berlin. This is the fourth blog in our 5-part series.  

Author: Dr. David Blumenthal, Friedrich-Alexander-Universität, Faculty of Engineering, Department Artificial Intelligence in Biomedical Engineering

Disease association databases (DisGeNET [1], OMIM [2], DrugBank [3], etc.) containing disease-gene, disease-symptom, or disease-drug links are used extensively in various subfields of data-centric biomedicine. However, in such databases, diseases are annotated using our current mainly organ- or symptom-based disease definitions (UMLS concept unique identifiers (CUIs), MONDO identifiers, ICD-10 codes, etc.), which are often umbrella terms for several yet unknown causal molecular mechanisms with similar phenotypic effects [4]. Consequently, widely used large-scale disease association databases are systematically biased towards reproducing our current disease ontologies and are hence of limited use for data-driven precision medicine approaches which aim at a more fine-grained understanding of disease subtypes [5].

In view of this bias, one option would be to simply not use large-scale disease association databases for data-centric precision medicine. However, given the vast amount of aggregated knowledge contained therein, this is very unsatisfactory. In the proposed project, we will hence aim at salvaging large-scale disease association databases for precision medicine by disentangling the associations for ill-defined umbrella disease terms into subsets of associations for which there is solid evidence that they indeed describe endotypes corresponding to disjoint molecular mechanisms.

Consider the example of DisGeNET – an extremely widely used database containing associations between diseases and genes and disease and genetic variants. In DisGeNET, diseases are annotated using UMLS CUIs and disease links are defined mainly via aggregation of the results of genome-wide association studies (GWAS). Yet, for mechanistically ill-defined “umbrella” CUIs that do not correspond to distinct molecular mechanisms, the underlying aggregated GWAS were most likely investigating patients suffering from distinct unknown disease subtypes. In DisGeNET, genetic associations for all of these different subtypes are merged under the set of associations for the umbrella CUI.

In the proposed project, our aim is to develop a network-based computational approach to reverse such unspecific aggregations, which affect virtually any large-scale disease association database. As input, the envisaged algorithm will accept a set of (at least two) disease association databases containing different types of association data (e.g., disease-gene and disease-symptom associations) and a potentially ill-defined disease term D contained in all input databases. As output, it will return subsets of the term’s associations in the input databases and a P-value which quantifies the evidence that these subsets indeed contain associations corresponding to distinct (possibly unknown) subtypes of the input disease D. Instead of starting with the unspecific associations for disease D, downstream precision medicine approaches can then make use of the uncovered subsets, potentially leading to more specific and promising hypotheses about molecular mechanisms driving complex diseases.

[1] J. Piñero et al., Nucleic Acids Res., vol. 48, no. D1, pp. D845–D855, 2020, doi: 10.1093/nar/gkz1021.

[2] J. S. Amberger et al., Nucleic Acids Res., vol. 47, no. D1, pp. D1038–D1043, 2019, doi: 10.1093/nar/gky1151.

[3] D. S. Wishart et al., Nucleic Acids Res., vol. 46, no. D1, pp. D1074–D1082, 2018, doi: 10.1093/nar/gkx1037.

[4] C. Nogales et al., Trends Pharmacol. Sci., vol. 43, no. 2, pp. 136–150, 2022, doi: 10.1016/

[5] S. Sadegh et al., Nat. Commun., vol. 14, no. 1, p. 1662, 2023, doi: 10.1038/s41467-023-37349-4.

Related Posts
Back to top