Examples ======== This page shows practical GO3 workflows from initialization to large-scale comparisons. Setup ----- .. code-block:: python import go3 # Load the GO directed acyclic graph from an OBO file. # This caches the ontology globally -- call once per session. go3.load_go_terms("go-basic.obo") # Parse a GAF file to build a gene-to-GO mapping. # The returned list contains one GAFAnnotation per valid line. annots = go3.load_gaf("goa_human.gaf") # Build annotation counts with ancestor propagation and compute IC values. # The counter object is required by IC-based similarity methods. counter = go3.build_term_counter(annots) Term-to-term similarity ----------------------- .. code-block:: python go1 = "GO:0006397" # mRNA processing go2 = "GO:0008380" # RNA splicing # These two BP terms are closely related (splicing is part of mRNA processing), # so we expect high similarity. # Available methods: resnik, lin, jc, simrel, iccoef, graphic, wang, topoicsim sim = go3.semantic_similarity(go1, go2, "lin", counter) print("Lin similarity:", sim) # e.g. ~0.71 Batch term similarity --------------------- .. code-block:: python list1 = ["GO:0006397", "GO:0008380", "GO:0008150"] list2 = ["GO:0008380", "GO:0006397", "GO:0009987"] # Computes similarity for each aligned pair: (list1[0], list2[0]), (list1[1], list2[1]), ... # Much faster than looping over semantic_similarity for large lists. scores = go3.batch_similarity(list1, list2, "resnik", counter) print(scores) # list of 3 floats The two lists must have the same length. Term-set similarity ------------------- .. code-block:: python terms_a = ["GO:0006397", "GO:0008380"] terms_b = ["GO:0008380", "GO:0009987"] # Groupwise strategies aggregate pairwise term similarities into a single set-level score. sim_bma = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="bma", counter=counter) sim_max = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="max", counter=counter) sim_avg = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="avg", counter=counter) sim_h = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="hausdorff", counter=counter) sim_gic = go3.termset_similarity(terms_a, terms_b, term_similarity="lin", groupwise="simgic", counter=counter) print(sim_bma, sim_max, sim_avg, sim_h, sim_gic) Gene-to-gene similarity ----------------------- .. code-block:: python # Compare two genes using their BP annotations. # GO3 looks up each gene's annotated GO terms, then applies # pairwise term similarity + groupwise aggregation. sim = go3.compare_genes("BRCA1", "CASP8", "BP", "lin", "bma", counter) print(sim) # e.g. ~0.30 Batch gene similarity --------------------- .. code-block:: python pairs = [("TP53", "BRCA1"), ("EGFR", "AKT1"), ("GSDME", "NLRP1")] # Batch gene comparison is parallelized internally -- much faster than # calling compare_genes in a Python loop for hundreds/thousands of pairs. scores = go3.compare_gene_pairs_batch(pairs, "BP", "lin", "bma", counter) print(scores) # list of 3 floats Ontology traversal ------------------ .. code-block:: python # Retrieve a term object by its GO ID term = go3.get_term_by_id("GO:0006397") print(term.name) # "mRNA processing" print(term.namespace) # "biological_process" print(term.depth) # maximum distance to root print(term.level) # minimum distance to root # Get all ancestors (includes the term itself) ancs = go3.ancestors("GO:0006397") print(len(ancs), "ancestors (including self)") # Find the deepest common ancestor of two terms dca = go3.deepest_common_ancestor("GO:0006397", "GO:0008380") print("DCA:", dca) # Common ancestors shared by both terms common = go3.common_ancestor("GO:0006397", "GO:0008380") print(len(common), "common ancestors") Inspecting IC values -------------------- .. code-block:: python # term_ic returns the Information Content for a single term. # Higher IC means the term is more specific (fewer annotations). ic_specific = go3.term_ic("GO:0006397", counter) # mRNA processing ic_general = go3.term_ic("GO:0008150", counter) # biological_process (root) print(f"IC(mRNA processing) = {ic_specific:.4f}") # high value (specific term) print(f"IC(biological_process) = {ic_general:.4f}") # near 0 (root term) Distance matrix for downstream analysis --------------------------------------- .. code-block:: python genes = ["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"] # Build an all-vs-all distance matrix suitable for clustering or embedding. # distance_transform converts similarity to distance (see Utilities docs). ordered_genes, dist = go3.gene_distance_matrix( genes, ontology="BP", similarity="lin", groupwise="bma", counter=counter, distance_transform="auto", ) print(ordered_genes) print(len(dist), len(dist[0])) # 5x5 matrix Embedding helpers (t-SNE / UMAP) --------------------------------- .. code-block:: python # Requires go3[viz] extras genes = ["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"] # t-SNE embedding from GO semantic similarity distances genes, emb_tsne = go3.tsne_genes( genes, "BP", "lin", "bma", counter, perplexity=2.0, # must be < number of genes n_iter=500, random_state=42, ) # UMAP embedding from the same distances genes, emb_umap = go3.umap_genes( genes, "BP", "lin", "bma", counter, n_neighbors=3, # must be < number of genes random_state=42, ) Quick plotting helpers ---------------------- .. code-block:: python genes, emb, fig, ax = go3.plot_tsne_genes( genes=["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"], ontology="BP", similarity="lin", groupwise="bma", counter=counter, perplexity=2.0, n_iter=500, random_state=42, annotate="auto", # label points that don't overlap title="GO3 t-SNE", ) # Reuse generic plotting for custom embeddings fig2, ax2 = go3.plot_embedding(emb, genes=genes, annotate="all", title="Custom plot") Thread control -------------- .. code-block:: python # Initialize the Rayon thread pool before heavy batch workloads. # Call once at startup -- the pool size cannot be changed after first use. go3.set_num_threads(8) Basic error handling patterns ----------------------------- .. code-block:: python # Unknown gene in compare_genes -> ValueError try: go3.compare_genes("FAKE_GENE", "BRCA1", "BP", "lin", "bma", counter) except ValueError as exc: print("compare_genes error:", exc) # Missing term or cross-namespace term pairs return similarity 0.0 print(go3.semantic_similarity("GO:9999999", "GO:0006397", "lin", counter)) # 0.0 End-to-end notebook: Parkinson gene panel ----------------------------------------- An end-to-end analysis is provided in ``scripts/Supplementary Notebook S2.ipynb`` (`view on GitHub `_). It applies GO3 to the Genomics England *Parkinson Disease and Complex Parkinsonism* gene panel and walks through a full functional-genomics pipeline: 1. **Load** the ontology and annotations. 2. **Quantify redundancy** via all-vs-all term similarity (``go3.batch_similarity``). 3. **Cluster** semantically overlapping BP terms with hierarchical clustering on the Lin distance matrix and select the highest-IC representative per cluster (~48% reduction in term count). 4. **Compute gene-level similarity** with ``go3.gene_distance_matrix`` (Lin + BMA) and rank the most functionally similar gene pairs. 5. **Visualize** the functional landscape via ``go3.plot_tsne_genes``. The resulting groups recover known biology — the PINK1/PRKN/PARK7 mitophagy module, the GCH1/TH/SPR dopamine-synthesis axis, and metal-ion-transport genes (SLC30A10/SLC39A14/FTL) — demonstrating how GO3 can condense a large, redundant enrichment output into an interpretable summary in a single notebook.