Utilities and Advanced APIs

Threading

set_num_threads(n_threads)

Controls the maximum number of internal threads used by GO3 batch operations.

import go3
go3.set_num_threads(8)

IC lookup

term_ic(go_id, counter) returns the Information Content for one term.

ic = go3.term_ic("GO:0006397", counter)

Gene distance matrices

gene_distance_matrix(genes=None, ontology="BP", similarity="lin", groupwise="bma", counter=..., distance_transform="auto")

Returns (gene_order, distance_matrix).

Distance transforms

The distance_transform parameter controls how similarity scores are converted to distances for clustering and embedding algorithms. Available options:

Transform

Formula

When to use

auto

depends on method

Recommended default. Automatically selects the best transform: one_minus for normalized methods (lin, wang, simrel, topoicsim), max_minus for unbounded methods (resnik, jc).

one_minus

1 - sim

Use for normalized methods that produce values in [0, 1]. Self-distance is 0, maximum distance is 1.

max_minus

max(all_sims) - sim

Use for unbounded methods (e.g., Resnik) where the range is not fixed. Produces distances relative to the highest observed similarity.

reciprocal

1 / (1 + sim)

Always valid regardless of method range, but produces a non-linear mapping. Useful as a fallback when other transforms are not appropriate.

Embedding APIs

These helpers build embeddings from precomputed GO similarity-derived distances.

tsne_genes(genes, ontology, similarity, groupwise, counter, ...)

Computes a t-SNE embedding from gene-level semantic similarity distances.

Key parameters:

  • perplexity (float) – Controls the balance between local and global structure. Higher values consider more neighbors per point. Must be less than the number of genes. A starting point is perplexity ~ sqrt(n_genes).

  • n_iter (int) – Number of optimization iterations. Default is typically 1000; 500 is often sufficient for exploration.

  • n_components (int) – Dimensionality of the output embedding (usually 2).

  • random_state (int) – Seed for reproducibility.

  • distance_transform (str) – How to convert similarity to distance (see above).

umap_genes(genes, ontology, similarity, groupwise, counter, ...)

Computes a UMAP embedding from gene-level semantic similarity distances.

Key parameters:

  • n_neighbors (int) – Number of nearest neighbors to consider when constructing the graph. Smaller values emphasize local structure; larger values capture more global patterns. Must be less than the number of genes. A starting point is n_neighbors ~ 15 for exploratory analysis.

  • min_dist (float) – Minimum distance between embedded points. Smaller values produce tighter clusters; larger values spread points more evenly.

  • random_state (int) – Seed for reproducibility.

  • distance_transform (str) – How to convert similarity to distance (see above).

plot_embedding(embedding, ...)

Creates a scatter plot from a 2D embedding array.

  • genes (list[str]) – Gene labels for each point.

  • annotate – Label display mode: "all" labels every point, "auto" labels only non-overlapping points, None disables labels.

  • title (str) – Plot title.

  • categories (list[str], optional) – Categorical labels for coloring points by group.

Returns (fig, ax) matplotlib objects.

plot_tsne_genes(...) / plot_umap_genes(...)

Convenience wrappers that combine embedding computation and plotting in one call. They accept the same parameters as tsne_genes / umap_genes plus the plotting parameters from plot_embedding.

Return (gene_order, embedding, fig, ax).

Minimal embedding example

genes = ["BRCA1", "CASP8", "TP53", "EGFR", "AKT1"]

genes, emb = go3.tsne_genes(
    genes,
    ontology="BP",
    similarity="lin",
    groupwise="bma",
    counter=counter,
    perplexity=2.0,
    random_state=42,
)

fig, ax = go3.plot_embedding(emb, genes=genes, annotate="auto", title="GO embedding")

API reference

compare_gene_pairs_batch(pairs, ontology, similarity, groupwise, counter)

Compute semantic similarity between genes in batches.

Parameters:
  • pairs (list of (str, str)) – List of pairs of genes to calculate the semantic similarity

  • ontology (str) – Name of the subontology of GO to use: BP, MF or CC.

  • similarity (str) – Name of the similarity method.

  • groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

List of similarity scores.

Return type:

list of float

Raises:

ValueError – If method or combine are unknown.

gene_distance_matrix(genes=None, ontology='BP', similarity='lin', groupwise='bma', counter=None, distance_transform='auto')

Compute a gene-to-gene distance matrix using GO semantic similarity.

Parameters:
  • genes (Optional[list[str]]) – List of genes to include. If None, uses all genes with annotations.

  • ontology (str) – Name of the subontology of GO to use: BP, MF or CC.

  • similarity (str) – Name of the similarity method.

  • groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.

  • counter (TermCounter) – Precomputed IC values.

  • distance_transform (str) – How to convert similarity to distance. Options: “auto”, “one_minus”, “reciprocal”, “max_minus”.

Returns:

Tuple with the gene order and a square distance matrix.

Return type:

(list[str], list[list[float]])

plot_embedding(embedding, genes=None, labels=None, title=None, annotate='auto', max_labels=200, figsize=Ellipsis, s=18.0, alpha=0.85, ax=None)

Plot a 2D embedding with matplotlib.

plot_tsne_genes(genes=None, ontology='BP', similarity='lin', groupwise='bma', counter=None, distance_transform='auto', n_components=2, perplexity=30.0, n_iter=1000, random_state=None, labels=None, title=None, annotate='auto', max_labels=200, figsize=Ellipsis, s=18.0, alpha=0.85, ax=None)

Compute t-SNE embeddings and plot them with matplotlib.

plot_umap_genes(genes=None, ontology='BP', similarity='lin', groupwise='bma', counter=None, distance_transform='auto', n_components=2, n_neighbors=15, min_dist=0.1, random_state=None, labels=None, title=None, annotate='auto', max_labels=200, figsize=Ellipsis, s=18.0, alpha=0.85, ax=None)

Compute UMAP embeddings and plot them with matplotlib.

set_num_threads(n_threads)

Configure the maximum number of threads rayon will use.

Parameters:

n_threads (int) – Number of threads to use. If 0, uses all available cores.

term_ic(go_id, counter)

Compute the Information Content (IC) of a GO term.

Parameters:
  • go_id (str) – GO term identifier.

  • counter (TermCounter) – Precomputed term counter with IC values.

Returns:

The IC of the GO term.

Return type:

float

termset_similarity(terms1, terms2, term_similarity='lin', groupwise='bma', counter=None)

Compute semantic similarity between two sets of GO terms.

Parameters:
  • terms1 (list of str) – First list of GO term IDs.

  • terms2 (list of str) – Second list of GO term IDs.

  • term_similarity (str) – Name of the pairwise similarity method.

  • groupwise (str) – Groupwise combination method. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

Similarity score.

Return type:

float

tsne_genes(genes=None, ontology='BP', similarity='lin', groupwise='bma', counter=None, distance_transform='auto', n_components=2, perplexity=30.0, n_iter=1000, random_state=None)

Compute t-SNE embeddings from a gene list using a precomputed distance matrix.

umap_genes(genes=None, ontology='BP', similarity='lin', groupwise='bma', counter=None, distance_transform='auto', n_components=2, n_neighbors=15, min_dist=0.1, random_state=None)

Compute UMAP embeddings from a gene list using a precomputed distance matrix.