Installation¶
GO3 provides Python bindings for a Rust implementation of Gene Ontology semantic similarity.
Install from PyPI:
pip install go3
Optional visualization dependencies:
pip install go3[viz]
Requirements¶
Python >= 3.7 (pre-built wheels are available for Linux, macOS, and Windows on common architectures).
A GO ontology file in OBO format (for example
go-basic.obo).A GO annotation file in GAF format for your organism.
If you call go3.load_go_terms() without a path, GO3 downloads go-basic.obo automatically.
Minimal workflow¶
import go3
# 1) Load the ontology graph into memory
go3.load_go_terms("go-basic.obo")
# 2) Parse annotations and build a gene-to-GO mapping
annots = go3.load_gaf("goa_human.gaf")
# 3) Compute annotation counts and IC values for every term
counter = go3.build_term_counter(annots)
# 4) Compute term-to-term similarity
sim = go3.semantic_similarity("GO:0006397", "GO:0008380", "lin", counter)
print(sim)
# 5) Compute gene-to-gene similarity (using BP namespace)
score = go3.compare_genes("TP53", "BRCA1", "BP", "lin", "bma", counter)
print(score)
Core concepts¶
The typical GO3 pipeline has three stages:
Load the ontology –
load_go_termsparses an OBO file and caches the full GO directed acyclic graph (DAG) in memory, including parent/child edges, depths, and ancestor sets.Load annotations –
load_gafparses a GAF file to build a mapping from genes (db_object_symbol) to their annotated GO terms. Obsolete terms are automatically remapped viareplaced_byorconsiderfields when possible.Build the counter –
build_term_counterwalks the annotations and computes per-term counts with ancestor propagation, then derives IC values for every term. The resultingTermCounteris passed to similarity functions.
Information Content (IC)¶
Information Content quantifies how specific a GO term is within its namespace. It is defined as:
where count(t) is the number of annotations for term t (including propagated annotations from descendant terms) and total(namespace) is the total annotation count for that namespace.
Intuitively, rare terms have high IC (they are more informative), while the root term – which is an ancestor of every term – has IC close to zero.
MICA (Most Informative Common Ancestor)¶
Many IC-based methods compare two terms by finding their Most Informative Common Ancestor (MICA): the common ancestor with the highest IC. Since IC grows with specificity, the MICA is the most specific term that subsumes both query terms.
For example, the Resnik similarity of two terms is simply IC(MICA), and Lin similarity normalizes it by the sum of individual ICs.
Namespaces¶
GO3 uses standard GO sub-ontologies:
BP: Biological ProcessMF: Molecular FunctionCC: Cellular Component
For gene-level APIs, select namespace explicitly via the ontology argument.
Troubleshooting¶
Problem |
Solution |
|---|---|
|
Verify the file path is correct. Omit the path in |
|
Gene names must match the |
Similarity is always 0.0 |
Terms may be in different namespaces, one or both IDs may be invalid, or the terms may have no common ancestor. Verify both terms belong to the same namespace. |
Wrong namespace in |
The |
Which OBO file to use? |
Use |
Next steps¶
Examples for end-to-end usage patterns
Semantic Similarity for available methods and formulas
Choosing a Similarity Method for picking the right method
Performance Guide for throughput-oriented workflows
Benchmarks for reproducible comparisons