Visualization (t-SNE / UMAP)¶

GO3 can build gene-to-gene distance matrices from semantic similarity and use them for embedding.

Install extras¶

pip install go3[viz]

End-to-end example¶

import go3

go3.load_go_terms("go-basic.obo")
annots = go3.load_gaf("goa_human.gaf")
counter = go3.build_term_counter(annots)

genes = ["TP53", "BRCA1", "EGFR", "AKT1", "CASP8"]

# 1) Distance matrix from GO similarity
ordered_genes, dist = go3.gene_distance_matrix(
    genes,
    ontology="BP",
    similarity="lin",
    groupwise="bma",
    counter=counter,
    distance_transform="auto",
)

# 2) Embeddings (precomputed distance)
ordered_genes, emb_tsne = go3.tsne_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    distance_transform="auto",
    perplexity=2.0,
    n_iter=500,
    random_state=42,
)

ordered_genes, emb_umap = go3.umap_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    distance_transform="auto",
    n_neighbors=3,
    min_dist=0.1,
    random_state=42,
)

Plot helpers¶

ordered_genes, emb, fig, ax = go3.plot_tsne_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    perplexity=2.0,
    n_iter=500,
    random_state=42,
    annotate="auto",
    title="GO3 t-SNE",
)

ordered_genes, emb_u, fig_u, ax_u = go3.plot_umap_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    n_neighbors=3,
    min_dist=0.1,
    random_state=42,
    annotate="auto",
    title="GO3 UMAP",
)

Example output using plot helpers:

t-SNE helper example

UMAP helper example

Understanding key parameters¶

Perplexity (t-SNE)¶

Perplexity roughly controls how many neighbors each point considers when computing the embedding. It balances attention between local and global structure:

Low perplexity (5–10): focuses on very local neighborhoods. Good for revealing tight clusters but may miss broader patterns.
High perplexity (30–50): considers more neighbors, capturing larger-scale structure but potentially merging distinct small clusters.
Rule of thumb: start with perplexity ~ sqrt(n_genes). For 100 genes, try 10; for 1000 genes, try ~30.
Hard constraint: perplexity must be strictly less than the number of genes.

n_neighbors (UMAP)¶

n_neighbors controls the size of the local neighborhood used to construct the UMAP graph:

Small n_neighbors (5–10): emphasizes fine-grained local structure. Good for finding small, tight clusters.
Large n_neighbors (30–50): captures more global topology. The embedding reflects broader relationships at the cost of local detail.
Rule of thumb: start with n_neighbors ~ 15 for exploratory analysis. Increase if the embedding looks too fragmented; decrease if clusters appear merged.
Hard constraint: n_neighbors must be strictly less than the number of genes.

min_dist (UMAP)¶

min_dist controls how tightly points are allowed to pack together:

Small min_dist (0.0–0.1): allows dense clusters with clear separation.
Large min_dist (0.5–1.0): spreads points more evenly, which can improve readability for large datasets.

Distance transforms¶

gene_distance_matrix supports:

auto
one_minus
max_minus
reciprocal

auto is usually the best choice:

normalized similarities (lin, wang, simrel, topoicsim) use a 1 - sim style transform
non-normalized similarities use a max-based transform

See the Utilities page for detailed descriptions of each transform.

Parameter constraints¶

tsne_genes: perplexity < number_of_genes
umap_genes: n_neighbors < number_of_genes
both require at least 2 genes

Compare multiple settings¶

The repository includes a sweep demo script:

python scripts/embedding_sweep_demo.py --n-genes 80 --embed both

Custom sweep:

python scripts/embedding_sweep_demo.py \
  --compare both \
  --sweep-ontologies BP,MF,CC \
  --sweep-similarities resnik,lin,wang,topoicsim \
  --distance-transform auto \
  --out-prefix embedding_sweep