# Visualization (t-SNE / UMAP)

GO3 can build gene-to-gene distance matrices from semantic similarity and use them for embedding.

## Install extras

```bash
pip install go3[viz]
```

## End-to-end example

```python
import go3

go3.load_go_terms("go-basic.obo")
annots = go3.load_gaf("goa_human.gaf")
counter = go3.build_term_counter(annots)

genes = ["TP53", "BRCA1", "EGFR", "AKT1", "CASP8"]

# 1) Distance matrix from GO similarity
ordered_genes, dist = go3.gene_distance_matrix(
    genes,
    ontology="BP",
    similarity="lin",
    groupwise="bma",
    counter=counter,
    distance_transform="auto",
)

# 2) Embeddings (precomputed distance)
ordered_genes, emb_tsne = go3.tsne_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    distance_transform="auto",
    perplexity=2.0,
    n_iter=500,
    random_state=42,
)

ordered_genes, emb_umap = go3.umap_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    distance_transform="auto",
    n_neighbors=3,
    min_dist=0.1,
    random_state=42,
)
```

## Plot helpers

```python
ordered_genes, emb, fig, ax = go3.plot_tsne_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    perplexity=2.0,
    n_iter=500,
    random_state=42,
    annotate="auto",
    title="GO3 t-SNE",
)

ordered_genes, emb_u, fig_u, ax_u = go3.plot_umap_genes(
    genes,
    "BP",
    "lin",
    "bma",
    counter,
    n_neighbors=3,
    min_dist=0.1,
    random_state=42,
    annotate="auto",
    title="GO3 UMAP",
)
```

Example output using plot helpers:

![t-SNE helper example](../../../imgs/plot_helper_tsne_example.png)

![UMAP helper example](../../../imgs/plot_helper_umap_example.png)

## Understanding key parameters

### Perplexity (t-SNE)

**Perplexity** roughly controls how many neighbors each point considers when computing the embedding. It balances attention between local and global structure:

- **Low perplexity** (5--10): focuses on very local neighborhoods. Good for revealing tight clusters but may miss broader patterns.
- **High perplexity** (30--50): considers more neighbors, capturing larger-scale structure but potentially merging distinct small clusters.
- **Rule of thumb**: start with `perplexity ~ sqrt(n_genes)`. For 100 genes, try 10; for 1000 genes, try ~30.
- **Hard constraint**: `perplexity` must be strictly less than the number of genes.

### n_neighbors (UMAP)

**n_neighbors** controls the size of the local neighborhood used to construct the UMAP graph:

- **Small n_neighbors** (5--10): emphasizes fine-grained local structure. Good for finding small, tight clusters.
- **Large n_neighbors** (30--50): captures more global topology. The embedding reflects broader relationships at the cost of local detail.
- **Rule of thumb**: start with `n_neighbors ~ 15` for exploratory analysis. Increase if the embedding looks too fragmented; decrease if clusters appear merged.
- **Hard constraint**: `n_neighbors` must be strictly less than the number of genes.

### min_dist (UMAP)

**min_dist** controls how tightly points are allowed to pack together:

- **Small min_dist** (0.0--0.1): allows dense clusters with clear separation.
- **Large min_dist** (0.5--1.0): spreads points more evenly, which can improve readability for large datasets.

## Distance transforms

`gene_distance_matrix` supports:

- `auto`
- `one_minus`
- `max_minus`
- `reciprocal`

`auto` is usually the best choice:

- normalized similarities (`lin`, `wang`, `simrel`, `topoicsim`) use a `1 - sim` style transform
- non-normalized similarities use a max-based transform

See the [Utilities](../utilities.rst) page for detailed descriptions of each transform.

## Parameter constraints

- `tsne_genes`: `perplexity < number_of_genes`
- `umap_genes`: `n_neighbors < number_of_genes`
- both require at least 2 genes

## Compare multiple settings

The repository includes a sweep demo script:

```bash
python scripts/embedding_sweep_demo.py --n-genes 80 --embed both
```

Custom sweep:

```bash
python scripts/embedding_sweep_demo.py \
  --compare both \
  --sweep-ontologies BP,MF,CC \
  --sweep-similarities resnik,lin,wang,topoicsim \
  --distance-transform auto \
  --out-prefix embedding_sweep
```