Semantic Similarity¶
GO3 supports term-level, gene-level, and gene-set semantic similarity across multiple methods.
Background: IC and MICA¶
Most similarity methods in GO3 rely on Information Content (IC) and the Most Informative Common Ancestor (MICA). See Installation for full definitions.
In brief:
IC(t) measures how specific a term is:
IC(t) = -log(count(t) / total(namespace)). Rare terms have high IC.MICA(t1, t2) is the common ancestor of t1 and t2 with the highest IC – i.e., the most specific shared concept.
Methods that do not require IC (e.g., Wang) use graph topology instead.
Quick reference¶
Method |
Key |
Family |
Typical range |
Description |
|---|---|---|---|---|
Resnik |
|
IC-based |
|
IC of the MICA; unbounded, corpus-dependent |
Lin |
|
IC-based |
|
Normalized Resnik: |
Jiang-Conrath similarity |
|
IC-based |
|
Distance-derived: |
SimRel |
|
IC-based |
|
Lin weighted by MICA relevance: penalizes shallow ancestors |
Information Coefficient |
|
IC-based |
|
IC-based coefficient combining ancestor information |
GraphIC |
|
Hybrid |
|
Combines IC values with graph-structural features |
Wang |
|
Topological |
|
Weighted ancestor contributions from graph topology; no IC needed |
TopoICSim |
|
Hybrid |
|
Topology-aware paths weighted by IC; bounded and normalized |
Groupwise strategies¶
When comparing sets of terms (or genes), a groupwise strategy aggregates pairwise term similarities into a single score.
Strategy |
Key |
Description |
|---|---|---|
Best Match Average |
|
Average of best matches from both directions; balanced and widely used |
Maximum |
|
Highest pairwise similarity; captures strongest shared function |
Average |
|
Mean of all pairwise similarities; measures overall functional overlap |
Hausdorff |
|
Worst-case best-match distance; guarantees a minimum similarity level |
SimGIC |
|
IC-weighted Jaccard of shared ancestor sets; set-based, not pairwise |
Choosing a method¶
See Choosing a Similarity Method for a dedicated guide on selecting the right similarity method and groupwise strategy for your use case.
Term-level APIs¶
semantic_similarity(id1, id2, method, counter)
Computes one score for one term pair.
Raises
ValueErrorifmethodis unknown.
batch_similarity(list1, list2, method, counter)
Computes one score per aligned pair.
Requires
len(list1) == len(list2).Raises
ValueErrorif list sizes differ or method is unknown.
Set-level API¶
termset_similarity(terms1, terms2, term_similarity="lin", groupwise="bma", counter=...)
Groupwise strategies:
bmamaxavghausdorffsimgic
Notes:
simgicis set-based and does not use the pairwise method in the same way as other strategies.For empty sets, GO3 returns
0.0.
Gene-level APIs¶
compare_genes(gene1, gene2, ontology, similarity, groupwise, counter)
Ontology must be one of
BP,MF,CC.Raises
ValueErrorif either gene is missing from loaded annotations.
compare_gene_pairs_batch(pairs, ontology, similarity, groupwise, counter)
Fast path for large pair lists.
Missing/empty per-gene term mappings yield
0.0for those pairs.
Gene-set APIs¶
compare_gene_sets(genes1, genes2, ontology="BP", similarity="lin", groupwise="bma", counter=...)
Computes pairwise gene similarities between all genes in both sets.
Aggregates the gene-by-gene similarity matrix with
groupwise.Uses the same groupwise method internally for each gene-pair GO-term comparison unless
term_groupwiseis supplied.Direct gene-set aggregation supports
bma,max,avg, andhausdorff.Raises
ValueErrorif any gene in either input set is missing from loaded annotations.Genes with no terms in the requested ontology contribute zero pairwise similarity.
compare_gene_set_pairs_batch(pairs, ontology="BP", similarity="lin", groupwise="bma", counter=...)
Fast path for many gene-set pairs.
pairsshould contain(genes1, genes2)items, where each side is a list of gene symbols.Missing genes are treated as unannotated in batch mode; pairs with no comparable annotated genes yield
0.0.
compare_gene_set_profiles(genes1, genes2, ontology="BP", similarity="lin", groupwise="bma", counter=...)
Alternative profile-based method.
Converts each gene set to a weighted GO-term profile, where each term weight is the number of genes annotated to that term.
Compares the two weighted GO profiles directly with
bma,max,avg,hausdorff, or weightedsimgic.
Practical behavior and edge cases¶
Invalid or missing GO IDs in similarity calls generally return
0.0.Terms from different namespaces produce
0.0.For normalized methods (for example
linandwang), self-similarity is typically near 1.0.
Distance-oriented workflow¶
For clustering/embedding workflows, use:
gene_distance_matrixgene_set_distance_matrixtsne_genesumap_genes
These functions convert similarity to distance using distance_transform rules (see Visualization (t-SNE / UMAP)).
Mathematical definitions¶
Resnik¶
The simplest IC-based measure: the similarity between two terms equals the IC of their MICA. The result is unbounded and depends on the annotation corpus.
Lin¶
Normalizes Resnik by the individual ICs, producing a score in [0, 1].
Jiang-Conrath (distance-derived similarity)¶
Converts the JC distance into a similarity via the reciprocal transform.
SimRel¶
Combines Lin similarity with a relevance factor that penalizes shallow (low-IC) MICAs.
Wang¶
Wang similarity uses weighted ancestor contributions from GO graph topology and does not require annotation-based IC values.
Each ancestor a of a term t receives a semantic value S_A(a) based on the path from t to a. Edge weights decay by relationship type:
is_aedges contribute a weight of 0.8part_ofedges contribute a weight of 0.6
The semantic value of t itself is 1. The overall similarity between two terms is the ratio of their shared semantic contributions to their total semantic values. Because it relies only on graph structure, Wang is useful when annotation data may be biased or incomplete.
TopoICSim¶
TopoICSim combines topology-aware paths and IC-derived weights to produce a bounded similarity in [0, 1].
It identifies the longest-information-content path between two terms through their common ancestors, weighting path segments by the IC of intermediate nodes. This captures both the structural distance in the DAG and the specificity of the connecting path, making it more discriminating than purely IC-based or purely topological methods.
Bibliography¶
API reference¶
- batch_similarity(list1, list2, method, counter)¶
Compute pairwise semantic similarity in batch using a selected method.
- Parameters:
list1 (list of str) – First list of GO term IDs.
list2 (list of str) – Second list of GO term IDs.
method (str) – Name of the similarity method.
counter (TermCounter) – Precomputed IC values.
- Returns:
List of similarity scores.
- Return type:
list of float
- Raises:
ValueError – If input lists differ in length or method is unknown.
- compare_gene_pairs_batch(pairs, ontology, similarity, groupwise, counter)¶
Compute semantic similarity between genes in batches.
- Parameters:
pairs (list of (str, str)) – List of pairs of genes to calculate the semantic similarity
ontology (str) – Name of the subontology of GO to use: BP, MF or CC.
similarity (str) – Name of the similarity method.
groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.
counter (TermCounter) – Precomputed IC values.
- Returns:
List of similarity scores.
- Return type:
list of float
- Raises:
ValueError – If method or combine are unknown.
- compare_genes(gene1, gene2, ontology, similarity, groupwise, counter)¶
Compute semantic similarity between genes.
- Parameters:
gene1 (str) – Gene symbol of the first gene.
gene2 (str) – Gene symbol of the second gene.
ontology (str) – Name of the subontology of GO to use: BP, MF or CC.
similarity (str) – Name of the similarity method.
groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.
counter (TermCounter) – Precomputed IC values.
- Returns:
Similarity score.
- Return type:
float
- Raises:
ValueError – If method or combine are unknown.
- semantic_similarity(id1, id2, method, counter)¶
Compute semantic similarity between two GO terms using a selected method.
- Parameters:
id1 (str) – First GO term ID.
id2 (str) – Second GO term ID.
method (str) – Name of the similarity method. Options: “resnik”, “lin”, etc.
counter (TermCounter) – Precomputed IC values.
- Returns:
Similarity score.
- Return type:
float
- Raises:
ValueError – If the method is unknown.
- term_ic(go_id, counter)¶
Compute the Information Content (IC) of a GO term.
- Parameters:
go_id (str) – GO term identifier.
counter (TermCounter) – Precomputed term counter with IC values.
- Returns:
The IC of the GO term.
- Return type:
float
- termset_similarity(terms1, terms2, term_similarity='lin', groupwise='bma', counter=None)¶
Compute semantic similarity between two sets of GO terms.
- Parameters:
terms1 (list of str) – First list of GO term IDs.
terms2 (list of str) – Second list of GO term IDs.
term_similarity (str) – Name of the pairwise similarity method.
groupwise (str) – Groupwise combination method. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.
counter (TermCounter) – Precomputed IC values.
- Returns:
Similarity score.
- Return type:
float