# Choosing a Similarity Method

This guide helps you select the right term-level similarity method and groupwise aggregation strategy for your use case.

## Term-level methods

### IC-based methods (Resnik, Lin, JC, SimRel)

Use these when you have a **representative annotation corpus** (e.g., a well-curated GAF file for your organism).

IC-based methods rely on annotation frequencies to determine how informative each term is. They work well when:

- Your annotation corpus is large and representative of the biological domain you're studying.
- You want scores that reflect functional specificity relative to known biology.
- You need well-studied methods with established benchmarks in the literature.

**Trade-offs**: scores depend on the annotation corpus. Different GAF versions or organisms produce different IC distributions and therefore different similarity scores. If your corpus is small or biased toward well-studied genes, IC estimates may be unreliable for less-annotated terms.

| Method   | Range   | Best for                                                   |
|----------|---------|------------------------------------------------------------|
| `resnik` | >= 0    | Raw IC of shared ancestor; good for ranking, not comparing across datasets |
| `lin`    | [0, 1]  | Normalized; most widely used IC-based method               |
| `jc`     | >= 0    | Distance-derived; sensitive to small IC differences        |
| `simrel` | [0, 1]  | Like Lin but penalizes shallow (uninformative) MICAs       |

### Topological methods (Wang)

Use Wang when **annotation data may be biased or unavailable**.

Wang similarity uses only the GO graph structure (edge types and path weights) to compute similarity. It assigns weights of 0.8 for `is_a` edges and 0.6 for `part_of` edges, decaying as paths get longer.

**Best for**:
- Organisms with sparse or biased annotations.
- Comparing results across different annotation versions without IC-induced variation.
- When you want a stable, corpus-independent baseline.

**Trade-off**: does not capture functional specificity from real annotation data.

### Hybrid methods (GraphIC, TopoICSim)

Use these when you want **both IC and topology**.

- **GraphIC**: combines IC values with graph-structural features.
- **TopoICSim**: weights topology-aware paths by IC, producing a normalized score in [0, 1]. It is more discriminating than purely IC-based methods for terms with similar IC but different graph positions.

**Best for**:
- Situations where both graph structure and annotation specificity matter.
- Getting a single method that balances multiple sources of information.

## Quick decision table

| Use case                                    | Recommended method |
|---------------------------------------------|--------------------|
| General-purpose, well-annotated organism     | `lin`              |
| Ranking by shared function (no normalization needed) | `resnik`   |
| Sparse or biased annotations                | `wang`             |
| Maximum discrimination between similar terms | `topoicsim`       |
| Penalize shallow common ancestors            | `simrel`           |
| Corpus-independent baseline                  | `wang`             |

## Groupwise strategy selection

When comparing genes (or term sets), individual pairwise term similarities must be aggregated into a single score. The choice of groupwise strategy affects what aspect of functional overlap you measure.

| Strategy    | Key          | What it measures                              | Best for                          |
|-------------|--------------|-----------------------------------------------|-----------------------------------|
| BMA         | `bma`        | Average of best matches from both directions  | Balanced comparison; most common  |
| Maximum     | `max`        | Highest pairwise similarity                   | "Do these genes share any function?" |
| Average     | `avg`        | Mean of all pairwise similarities             | Overall functional overlap        |
| SimGIC      | `simgic`     | IC-weighted Jaccard of shared ancestor sets   | IC-aware set overlap; good for clustering |
| Hausdorff   | `hausdorff`  | Worst-case best-match                         | Worst-case guarantee; conservative |

**General recommendation**: start with **BMA** (`bma`). It is the most widely used strategy and provides a balanced view of functional similarity. Use **SimGIC** when you want IC-weighted set overlap (especially useful for clustering and enrichment-adjacent tasks). Use **MAX** for a permissive "any shared function" signal.

## Namespace guidance

The Gene Ontology has three namespaces. Choose based on the biological question:

- **BP (Biological Process)**: largest namespace with the most annotations. Best for characterizing overall gene function and pathway involvement. Most commonly used in literature benchmarks.
- **MF (Molecular Function)**: describes biochemical activity (e.g., kinase activity, binding). More specific and smaller than BP. Best for molecular-level questions.
- **CC (Cellular Component)**: describes subcellular localization (e.g., nucleus, membrane). Smallest namespace. Best for localization-related analyses.

If unsure, start with **BP** -- it provides the broadest functional characterization and the most statistical power due to its size.