# Choosing a Similarity Method This guide helps you select the right term-level similarity method and groupwise aggregation strategy for your use case. ## Term-level methods ### IC-based methods (Resnik, Lin, JC, SimRel) Use these when you have a **representative annotation corpus** (e.g., a well-curated GAF file for your organism). IC-based methods rely on annotation frequencies to determine how informative each term is. They work well when: - Your annotation corpus is large and representative of the biological domain you're studying. - You want scores that reflect functional specificity relative to known biology. - You need well-studied methods with established benchmarks in the literature. **Trade-offs**: scores depend on the annotation corpus. Different GAF versions or organisms produce different IC distributions and therefore different similarity scores. If your corpus is small or biased toward well-studied genes, IC estimates may be unreliable for less-annotated terms. | Method | Range | Best for | |----------|---------|------------------------------------------------------------| | `resnik` | >= 0 | Raw IC of shared ancestor; good for ranking, not comparing across datasets | | `lin` | [0, 1] | Normalized; most widely used IC-based method | | `jc` | >= 0 | Distance-derived; sensitive to small IC differences | | `simrel` | [0, 1] | Like Lin but penalizes shallow (uninformative) MICAs | ### Topological methods (Wang) Use Wang when **annotation data may be biased or unavailable**. Wang similarity uses only the GO graph structure (edge types and path weights) to compute similarity. It assigns weights of 0.8 for `is_a` edges and 0.6 for `part_of` edges, decaying as paths get longer. **Best for**: - Organisms with sparse or biased annotations. - Comparing results across different annotation versions without IC-induced variation. - When you want a stable, corpus-independent baseline. **Trade-off**: does not capture functional specificity from real annotation data. ### Hybrid methods (GraphIC, TopoICSim) Use these when you want **both IC and topology**. - **GraphIC**: combines IC values with graph-structural features. - **TopoICSim**: weights topology-aware paths by IC, producing a normalized score in [0, 1]. It is more discriminating than purely IC-based methods for terms with similar IC but different graph positions. **Best for**: - Situations where both graph structure and annotation specificity matter. - Getting a single method that balances multiple sources of information. ## Quick decision table | Use case | Recommended method | |---------------------------------------------|--------------------| | General-purpose, well-annotated organism | `lin` | | Ranking by shared function (no normalization needed) | `resnik` | | Sparse or biased annotations | `wang` | | Maximum discrimination between similar terms | `topoicsim` | | Penalize shallow common ancestors | `simrel` | | Corpus-independent baseline | `wang` | ## Groupwise strategy selection When comparing genes (or term sets), individual pairwise term similarities must be aggregated into a single score. The choice of groupwise strategy affects what aspect of functional overlap you measure. | Strategy | Key | What it measures | Best for | |-------------|--------------|-----------------------------------------------|-----------------------------------| | BMA | `bma` | Average of best matches from both directions | Balanced comparison; most common | | Maximum | `max` | Highest pairwise similarity | "Do these genes share any function?" | | Average | `avg` | Mean of all pairwise similarities | Overall functional overlap | | SimGIC | `simgic` | IC-weighted Jaccard of shared ancestor sets | IC-aware set overlap; good for clustering | | Hausdorff | `hausdorff` | Worst-case best-match | Worst-case guarantee; conservative | **General recommendation**: start with **BMA** (`bma`). It is the most widely used strategy and provides a balanced view of functional similarity. Use **SimGIC** when you want IC-weighted set overlap (especially useful for clustering and enrichment-adjacent tasks). Use **MAX** for a permissive "any shared function" signal. ## Namespace guidance The Gene Ontology has three namespaces. Choose based on the biological question: - **BP (Biological Process)**: largest namespace with the most annotations. Best for characterizing overall gene function and pathway involvement. Most commonly used in literature benchmarks. - **MF (Molecular Function)**: describes biochemical activity (e.g., kinase activity, binding). More specific and smaller than BP. Best for molecular-level questions. - **CC (Cellular Component)**: describes subcellular localization (e.g., nucleus, membrane). Smallest namespace. Best for localization-related analyses. If unsure, start with **BP** -- it provides the broadest functional characterization and the most statistical power due to its size.