Choosing a Similarity Method¶
This guide helps you select the right term-level similarity method and groupwise aggregation strategy for your use case.
Term-level methods¶
IC-based methods (Resnik, Lin, JC, SimRel)¶
Use these when you have a representative annotation corpus (e.g., a well-curated GAF file for your organism).
IC-based methods rely on annotation frequencies to determine how informative each term is. They work well when:
Your annotation corpus is large and representative of the biological domain you’re studying.
You want scores that reflect functional specificity relative to known biology.
You need well-studied methods with established benchmarks in the literature.
Trade-offs: scores depend on the annotation corpus. Different GAF versions or organisms produce different IC distributions and therefore different similarity scores. If your corpus is small or biased toward well-studied genes, IC estimates may be unreliable for less-annotated terms.
Method |
Range |
Best for |
|---|---|---|
|
>= 0 |
Raw IC of shared ancestor; good for ranking, not comparing across datasets |
|
[0, 1] |
Normalized; most widely used IC-based method |
|
>= 0 |
Distance-derived; sensitive to small IC differences |
|
[0, 1] |
Like Lin but penalizes shallow (uninformative) MICAs |
Topological methods (Wang)¶
Use Wang when annotation data may be biased or unavailable.
Wang similarity uses only the GO graph structure (edge types and path weights) to compute similarity. It assigns weights of 0.8 for is_a edges and 0.6 for part_of edges, decaying as paths get longer.
Best for:
Organisms with sparse or biased annotations.
Comparing results across different annotation versions without IC-induced variation.
When you want a stable, corpus-independent baseline.
Trade-off: does not capture functional specificity from real annotation data.
Hybrid methods (GraphIC, TopoICSim)¶
Use these when you want both IC and topology.
GraphIC: combines IC values with graph-structural features.
TopoICSim: weights topology-aware paths by IC, producing a normalized score in [0, 1]. It is more discriminating than purely IC-based methods for terms with similar IC but different graph positions.
Best for:
Situations where both graph structure and annotation specificity matter.
Getting a single method that balances multiple sources of information.
Quick decision table¶
Use case |
Recommended method |
|---|---|
General-purpose, well-annotated organism |
|
Ranking by shared function (no normalization needed) |
|
Sparse or biased annotations |
|
Maximum discrimination between similar terms |
|
Penalize shallow common ancestors |
|
Corpus-independent baseline |
|
Groupwise strategy selection¶
When comparing genes (or term sets), individual pairwise term similarities must be aggregated into a single score. The choice of groupwise strategy affects what aspect of functional overlap you measure.
Strategy |
Key |
What it measures |
Best for |
|---|---|---|---|
BMA |
|
Average of best matches from both directions |
Balanced comparison; most common |
Maximum |
|
Highest pairwise similarity |
“Do these genes share any function?” |
Average |
|
Mean of all pairwise similarities |
Overall functional overlap |
SimGIC |
|
IC-weighted Jaccard of shared ancestor sets |
IC-aware set overlap; good for clustering |
Hausdorff |
|
Worst-case best-match |
Worst-case guarantee; conservative |
General recommendation: start with BMA (bma). It is the most widely used strategy and provides a balanced view of functional similarity. Use SimGIC when you want IC-weighted set overlap (especially useful for clustering and enrichment-adjacent tasks). Use MAX for a permissive “any shared function” signal.
Namespace guidance¶
The Gene Ontology has three namespaces. Choose based on the biological question:
BP (Biological Process): largest namespace with the most annotations. Best for characterizing overall gene function and pathway involvement. Most commonly used in literature benchmarks.
MF (Molecular Function): describes biochemical activity (e.g., kinase activity, binding). More specific and smaller than BP. Best for molecular-level questions.
CC (Cellular Component): describes subcellular localization (e.g., nucleus, membrane). Smallest namespace. Best for localization-related analyses.
If unsure, start with BP – it provides the broadest functional characterization and the most statistical power due to its size.