To capture hierarchical organization, a particularly promising direction in computer science has been the development of the ontology, a model that divides its object into a set of fundamental entities and relationships among those entities. Ontologies arise from a branch of philosophy known as metaphysics, which is concerned with the nature of what exists and the categories into which the world’s objects naturally fall. Ontologies build upon and extend network models in two key ways: ‘entities’ refer not only to elemental objects but also to any meaningful grouping of objects, and ‘relationships’ refer not only to direct connections but also to nested structures, such as one entity being a part or type of another. Thus, ontologies explicitly allow for a higher order organization of knowledge, missing from raw networks. They have been key for building powerful knowledge representation and reasoning systems in many domains, including biomedicine.
Ontologies became very influential in cell biology through the development of the Gene Ontology (GO). GO is a major resource of knowledge about genes, gene products, and the hierarchy of cellular components, molecular functions and biological processes in which they participate. Entities in GO (called GO terms) are hierarchical groupings of other entities. For example, the biological process of “DNA replication elongation” is a type of “DNA strand elongation” and is a part of the more general process of “DNA replication” (Figure 1). The GO resource is presently very large, with nearly 35,000 GO terms connected by 65,000 hierarchical term- term relations, describing more than 80 different species. The impact of GO is hard to overstate – just try to think of a single modern ‘omics’ analysis that does not use GO to validate a novel data set or approach, or to generate new mechanistic hypotheses. In a sense GO is the most universal, and universally accepted, model of a cell that we currently have.
We will expand on this preliminary work to introduce a system for organizing molecular interactions and cancer ‘omics data as a genomics-driven, crowd-sourced Gene Ontology of Human Cancer. Our goal is to address several parallel challenges in cancer genomics and bioinformatics:
- The need to move beyond clustering to recognize the multi-scale structure embedded in data
- The need to improve ontologies of gene function in their scalability, consistency and coverage
- The continued need to provide cancer biomedicine with an accurate map of hallmark pathways and processes that drive tumor initiation and progression.
Cancer is a disease that is not only complex (driven by a combination of genes) but also wildly heterogeneous (gene combinations vary greatly between tumors of the same type). Major projects such as The Cancer Genome Atlas (TCGA) have released data for thousands of tumors, with each case having somatic mutation data, copy number, methylation, and mRNA/miRNA expression datasets. In addition, cancer centers are beginning to routinely measure somatic mutations and other ‘omics profiles for patients, although as yet only a small fraction of this information is actionable.
We are continuously developing methodologies that construct data-driven gene ontologies, and we will apply these methods to the multiple layers of genome-scale data that are presently available for thousands of patient tumors and hundreds of cell lines. We will also explore methods by which data-driven gene ontologies might evolve over time with each new ‘omics dataset that is published. The enticing possibility is that the wealth of human cancer data could be analyzed to assemble a human cancer gene ontology systematically, with less and less reliance on back curation of the literature. Ultimately, the desired outcome is to enable a shift from using ontologies to evaluate data to using data to construct and evaluate ontologies—that is, from a regime in which the ontology is viewed as gold standard to one in which it is the major result.
Figure 1. From ontologies to active ontologies. A subset of the Gene Ontology 10, left, alongside a subset of an active ontology for event planning 32, right. Red relationships and entities indicate dynamic computation.