Examples¶

This page shows practical GO3 workflows from initialization to large-scale comparisons.

Setup¶

import go3

# Load the GO directed acyclic graph from an OBO file.
# This caches the ontology globally -- call once per session.
go3.load_go_terms("go-basic.obo")

# Parse a GAF file to build a gene-to-GO mapping.
# The returned list contains one GAFAnnotation per valid line.
annots = go3.load_gaf("goa_human.gaf")

# Build annotation counts with ancestor propagation and compute IC values.
# The counter object is required by IC-based similarity methods.
counter = go3.build_term_counter(annots)

Term-to-term similarity¶

go1 = "GO:0006397"  # mRNA processing
go2 = "GO:0008380"  # RNA splicing

# These two BP terms are closely related (splicing is part of mRNA processing),
# so we expect high similarity.
# Available methods: resnik, lin, jc, simrel, iccoef, graphic, wang, topoicsim
sim = go3.semantic_similarity(go1, go2, "lin", counter)
print("Lin similarity:", sim)  # e.g. ~0.71

Batch term similarity¶

list1 = ["GO:0006397", "GO:0008380", "GO:0008150"]
list2 = ["GO:0008380", "GO:0006397", "GO:0009987"]

# Computes similarity for each aligned pair: (list1[0], list2[0]), (list1[1], list2[1]), ...
# Much faster than looping over semantic_similarity for large lists.
scores = go3.batch_similarity(list1, list2, "resnik", counter)
print(scores)  # list of 3 floats

The two lists must have the same length.

Term-set similarity¶

terms_a = ["GO:0006397", "GO:0008380"]
terms_b = ["GO:0008380", "GO:0009987"]

# Groupwise strategies aggregate pairwise term similarities into a single set-level score.
sim_bma = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="bma", counter=counter)
sim_max = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="max", counter=counter)
sim_avg = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="avg", counter=counter)
sim_h   = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="hausdorff", counter=counter)
sim_gic = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="simgic", counter=counter)

print(sim_bma, sim_max, sim_avg, sim_h, sim_gic)

Gene-to-gene similarity¶

# Compare two genes using their BP annotations.
# GO3 looks up each gene's annotated GO terms, then applies
# pairwise term similarity + groupwise aggregation.
sim = go3.compare_genes("BRCA1", "CASP8", "BP", "lin", "bma", counter)
print(sim)  # e.g. ~0.30

Batch gene similarity¶

pairs = [("TP53", "BRCA1"), ("EGFR", "AKT1"), ("GSDME", "NLRP1")]

# Batch gene comparison is parallelized internally -- much faster than
# calling compare_genes in a Python loop for hundreds/thousands of pairs.
scores = go3.compare_gene_pairs_batch(pairs, "BP", "lin", "bma", counter)
print(scores)  # list of 3 floats

Ontology traversal¶

# Retrieve a term object by its GO ID
term = go3.get_term_by_id("GO:0006397")
print(term.name)       # "mRNA processing"
print(term.namespace)  # "biological_process"
print(term.depth)      # maximum distance to root
print(term.level)      # minimum distance to root

# Get all ancestors (includes the term itself)
ancs = go3.ancestors("GO:0006397")
print(len(ancs), "ancestors (including self)")

# Find the deepest common ancestor of two terms
dca = go3.deepest_common_ancestor("GO:0006397", "GO:0008380")
print("DCA:", dca)

# Common ancestors shared by both terms
common = go3.common_ancestor("GO:0006397", "GO:0008380")
print(len(common), "common ancestors")

Inspecting IC values¶

# term_ic returns the Information Content for a single term.
# Higher IC means the term is more specific (fewer annotations).
ic_specific = go3.term_ic("GO:0006397", counter)   # mRNA processing
ic_general  = go3.term_ic("GO:0008150", counter)    # biological_process (root)

print(f"IC(mRNA processing) = {ic_specific:.4f}")   # high value (specific term)
print(f"IC(biological_process) = {ic_general:.4f}") # near 0 (root term)

Distance matrix for downstream analysis¶

genes = ["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"]

# Build an all-vs-all distance matrix suitable for clustering or embedding.
# distance_transform converts similarity to distance (see Utilities docs).
ordered_genes, dist = go3.gene_distance_matrix(
    genes,
    ontology="BP",
    similarity="lin",
    groupwise="bma",
    counter=counter,
    distance_transform="auto",
)

print(ordered_genes)
print(len(dist), len(dist[0]))  # 5x5 matrix

Embedding helpers (t-SNE / UMAP)¶

# Requires go3[viz] extras
genes = ["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"]

# t-SNE embedding from GO semantic similarity distances
genes, emb_tsne = go3.tsne_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    perplexity=2.0,   # must be < number of genes
    n_iter=500,
    random_state=42,
)

# UMAP embedding from the same distances
genes, emb_umap = go3.umap_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    n_neighbors=3,    # must be < number of genes
    random_state=42,
)

Quick plotting helpers¶

genes, emb, fig, ax = go3.plot_tsne_genes(
    genes=["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"],
    ontology="BP",
    similarity="lin",
    groupwise="bma",
    counter=counter,
    perplexity=2.0,
    n_iter=500,
    random_state=42,
    annotate="auto",  # label points that don't overlap
    title="GO3 t-SNE",
)

# Reuse generic plotting for custom embeddings
fig2, ax2 = go3.plot_embedding(emb, genes=genes, annotate="all", title="Custom plot")

Thread control¶

# Initialize the Rayon thread pool before heavy batch workloads.
# Call once at startup -- the pool size cannot be changed after first use.
go3.set_num_threads(8)

Basic error handling patterns¶

# Unknown gene in compare_genes -> ValueError
try:
    go3.compare_genes("FAKE_GENE", "BRCA1", "BP", "lin", "bma", counter)
except ValueError as exc:
    print("compare_genes error:", exc)

# Missing term or cross-namespace term pairs return similarity 0.0
print(go3.semantic_similarity("GO:9999999", "GO:0006397", "lin", counter))  # 0.0

End-to-end notebook: Parkinson gene panel¶

An end-to-end analysis is provided in scripts/Supplementary Notebook S2.ipynb (view on GitHub). It applies GO3 to the Genomics England Parkinson Disease and Complex Parkinsonism gene panel and walks through a full functional-genomics pipeline:

Load the ontology and annotations.
Quantify redundancy via all-vs-all term similarity (go3.batch_similarity).
Cluster semantically overlapping BP terms with hierarchical clustering on the Lin distance matrix and select the highest-IC representative per cluster (~48% reduction in term count).
Compute gene-level similarity with go3.gene_distance_matrix (Lin + BMA) and rank the most functionally similar gene pairs.
Visualize the functional landscape via go3.plot_tsne_genes.

The resulting groups recover known biology — the PINK1/PRKN/PARK7 mitophagy module, the GCH1/TH/SPR dopamine-synthesis axis, and metal-ion-transport genes (SLC30A10/SLC39A14/FTL) — demonstrating how GO3 can condense a large, redundant enrichment output into an interpretable summary in a single notebook.