Examples¶
This page shows practical GO3 workflows from initialization to large-scale comparisons.
Setup¶
import go3
# Load the GO directed acyclic graph from an OBO file.
# This caches the ontology globally -- call once per session.
go3.load_go_terms("go-basic.obo")
# Parse a GAF file to build a gene-to-GO mapping.
# The returned list contains one GAFAnnotation per valid line.
annots = go3.load_gaf("goa_human.gaf")
# Build annotation counts with ancestor propagation and compute IC values.
# The counter object is required by IC-based similarity methods.
counter = go3.build_term_counter(annots)
Term-to-term similarity¶
go1 = "GO:0006397" # mRNA processing
go2 = "GO:0008380" # RNA splicing
# These two BP terms are closely related (splicing is part of mRNA processing),
# so we expect high similarity.
# Available methods: resnik, lin, jc, simrel, iccoef, graphic, wang, topoicsim
sim = go3.semantic_similarity(go1, go2, "lin", counter)
print("Lin similarity:", sim) # e.g. ~0.71
Batch term similarity¶
list1 = ["GO:0006397", "GO:0008380", "GO:0008150"]
list2 = ["GO:0008380", "GO:0006397", "GO:0009987"]
# Computes similarity for each aligned pair: (list1[0], list2[0]), (list1[1], list2[1]), ...
# Much faster than looping over semantic_similarity for large lists.
scores = go3.batch_similarity(list1, list2, "resnik", counter)
print(scores) # list of 3 floats
The two lists must have the same length.
Term-set similarity¶
terms_a = ["GO:0006397", "GO:0008380"]
terms_b = ["GO:0008380", "GO:0009987"]
# Groupwise strategies aggregate pairwise term similarities into a single set-level score.
sim_bma = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="bma", counter=counter)
sim_max = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="max", counter=counter)
sim_avg = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="avg", counter=counter)
sim_h = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="hausdorff", counter=counter)
sim_gic = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="simgic", counter=counter)
print(sim_bma, sim_max, sim_avg, sim_h, sim_gic)
Gene-to-gene similarity¶
# Compare two genes using their BP annotations.
# GO3 looks up each gene's annotated GO terms, then applies
# pairwise term similarity + groupwise aggregation.
sim = go3.compare_genes("BRCA1", "CASP8", "BP", "lin", "bma", counter)
print(sim) # e.g. ~0.30
Batch gene similarity¶
pairs = [("TP53", "BRCA1"), ("EGFR", "AKT1"), ("GSDME", "NLRP1")]
# Batch gene comparison is parallelized internally -- much faster than
# calling compare_genes in a Python loop for hundreds/thousands of pairs.
scores = go3.compare_gene_pairs_batch(pairs, "BP", "lin", "bma", counter)
print(scores) # list of 3 floats
Ontology traversal¶
# Retrieve a term object by its GO ID
term = go3.get_term_by_id("GO:0006397")
print(term.name) # "mRNA processing"
print(term.namespace) # "biological_process"
print(term.depth) # maximum distance to root
print(term.level) # minimum distance to root
# Get all ancestors (includes the term itself)
ancs = go3.ancestors("GO:0006397")
print(len(ancs), "ancestors (including self)")
# Find the deepest common ancestor of two terms
dca = go3.deepest_common_ancestor("GO:0006397", "GO:0008380")
print("DCA:", dca)
# Common ancestors shared by both terms
common = go3.common_ancestor("GO:0006397", "GO:0008380")
print(len(common), "common ancestors")
Inspecting IC values¶
# term_ic returns the Information Content for a single term.
# Higher IC means the term is more specific (fewer annotations).
ic_specific = go3.term_ic("GO:0006397", counter) # mRNA processing
ic_general = go3.term_ic("GO:0008150", counter) # biological_process (root)
print(f"IC(mRNA processing) = {ic_specific:.4f}") # high value (specific term)
print(f"IC(biological_process) = {ic_general:.4f}") # near 0 (root term)
Distance matrix for downstream analysis¶
genes = ["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"]
# Build an all-vs-all distance matrix suitable for clustering or embedding.
# distance_transform converts similarity to distance (see Utilities docs).
ordered_genes, dist = go3.gene_distance_matrix(
genes,
ontology="BP",
similarity="lin",
groupwise="bma",
counter=counter,
distance_transform="auto",
)
print(ordered_genes)
print(len(dist), len(dist[0])) # 5x5 matrix
Embedding helpers (t-SNE / UMAP)¶
# Requires go3[viz] extras
genes = ["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"]
# t-SNE embedding from GO semantic similarity distances
genes, emb_tsne = go3.tsne_genes(
genes,
"BP",
"lin",
"bma",
counter,
perplexity=2.0, # must be < number of genes
n_iter=500,
random_state=42,
)
# UMAP embedding from the same distances
genes, emb_umap = go3.umap_genes(
genes,
"BP",
"lin",
"bma",
counter,
n_neighbors=3, # must be < number of genes
random_state=42,
)
Quick plotting helpers¶
genes, emb, fig, ax = go3.plot_tsne_genes(
genes=["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"],
ontology="BP",
similarity="lin",
groupwise="bma",
counter=counter,
perplexity=2.0,
n_iter=500,
random_state=42,
annotate="auto", # label points that don't overlap
title="GO3 t-SNE",
)
# Reuse generic plotting for custom embeddings
fig2, ax2 = go3.plot_embedding(emb, genes=genes, annotate="all", title="Custom plot")
Thread control¶
# Initialize the Rayon thread pool before heavy batch workloads.
# Call once at startup -- the pool size cannot be changed after first use.
go3.set_num_threads(8)
Basic error handling patterns¶
# Unknown gene in compare_genes -> ValueError
try:
go3.compare_genes("FAKE_GENE", "BRCA1", "BP", "lin", "bma", counter)
except ValueError as exc:
print("compare_genes error:", exc)
# Missing term or cross-namespace term pairs return similarity 0.0
print(go3.semantic_similarity("GO:9999999", "GO:0006397", "lin", counter)) # 0.0
End-to-end notebook: Parkinson gene panel¶
An end-to-end analysis is provided in scripts/Supplementary Notebook S2.ipynb
(view on GitHub).
It applies GO3 to the Genomics England Parkinson Disease and Complex Parkinsonism
gene panel and walks through a full functional-genomics pipeline:
Load the ontology and annotations.
Quantify redundancy via all-vs-all term similarity (
go3.batch_similarity).Cluster semantically overlapping BP terms with hierarchical clustering on the Lin distance matrix and select the highest-IC representative per cluster (~48% reduction in term count).
Compute gene-level similarity with
go3.gene_distance_matrix(Lin + BMA) and rank the most functionally similar gene pairs.Visualize the functional landscape via
go3.plot_tsne_genes.
The resulting groups recover known biology — the PINK1/PRKN/PARK7 mitophagy module, the GCH1/TH/SPR dopamine-synthesis axis, and metal-ion-transport genes (SLC30A10/SLC39A14/FTL) — demonstrating how GO3 can condense a large, redundant enrichment output into an interpretable summary in a single notebook.