Semantic Similarity

GO3 supports term-level, gene-level, and gene-set semantic similarity across multiple methods.

Background: IC and MICA

Most similarity methods in GO3 rely on Information Content (IC) and the Most Informative Common Ancestor (MICA). See Installation for full definitions.

In brief:

  • IC(t) measures how specific a term is: IC(t) = -log(count(t) / total(namespace)). Rare terms have high IC.

  • MICA(t1, t2) is the common ancestor of t1 and t2 with the highest IC – i.e., the most specific shared concept.

Methods that do not require IC (e.g., Wang) use graph topology instead.

Quick reference

Similarity methods (method argument)

Method

Key

Family

Typical range

Description

Resnik

resnik

IC-based

>= 0

IC of the MICA; unbounded, corpus-dependent

Lin

lin

IC-based

[0, 1]

Normalized Resnik: 2*IC(MICA) / (IC(t1)+IC(t2))

Jiang-Conrath similarity

jc

IC-based

>= 0

Distance-derived: 1 / (1 + IC(t1)+IC(t2) - 2*IC(MICA))

SimRel

simrel

IC-based

[0, 1]

Lin weighted by MICA relevance: penalizes shallow ancestors

Information Coefficient

iccoef

IC-based

>= 0

IC-based coefficient combining ancestor information

GraphIC

graphic

Hybrid

>= 0

Combines IC values with graph-structural features

Wang

wang

Topological

[0, 1]

Weighted ancestor contributions from graph topology; no IC needed

TopoICSim

topoicsim

Hybrid

[0, 1]

Topology-aware paths weighted by IC; bounded and normalized

Groupwise strategies

When comparing sets of terms (or genes), a groupwise strategy aggregates pairwise term similarities into a single score.

Groupwise strategies (groupwise argument)

Strategy

Key

Description

Best Match Average

bma

Average of best matches from both directions; balanced and widely used

Maximum

max

Highest pairwise similarity; captures strongest shared function

Average

avg

Mean of all pairwise similarities; measures overall functional overlap

Hausdorff

hausdorff

Worst-case best-match distance; guarantees a minimum similarity level

SimGIC

simgic

IC-weighted Jaccard of shared ancestor sets; set-based, not pairwise

Choosing a method

See Choosing a Similarity Method for a dedicated guide on selecting the right similarity method and groupwise strategy for your use case.

Term-level APIs

semantic_similarity(id1, id2, method, counter)

  • Computes one score for one term pair.

  • Raises ValueError if method is unknown.

batch_similarity(list1, list2, method, counter)

  • Computes one score per aligned pair.

  • Requires len(list1) == len(list2).

  • Raises ValueError if list sizes differ or method is unknown.

Set-level API

termset_similarity(terms1, terms2, term_similarity="lin", groupwise="bma", counter=...)

Groupwise strategies:

  • bma

  • max

  • avg

  • hausdorff

  • simgic

Notes:

  • simgic is set-based and does not use the pairwise method in the same way as other strategies.

  • For empty sets, GO3 returns 0.0.

Gene-level APIs

compare_genes(gene1, gene2, ontology, similarity, groupwise, counter)

  • Ontology must be one of BP, MF, CC.

  • Raises ValueError if either gene is missing from loaded annotations.

compare_gene_pairs_batch(pairs, ontology, similarity, groupwise, counter)

  • Fast path for large pair lists.

  • Missing/empty per-gene term mappings yield 0.0 for those pairs.

Gene-set APIs

compare_gene_sets(genes1, genes2, ontology="BP", similarity="lin", groupwise="bma", counter=...)

  • Computes pairwise gene similarities between all genes in both sets.

  • Aggregates the gene-by-gene similarity matrix with groupwise.

  • Uses the same groupwise method internally for each gene-pair GO-term comparison unless term_groupwise is supplied.

  • Direct gene-set aggregation supports bma, max, avg, and hausdorff.

  • Raises ValueError if any gene in either input set is missing from loaded annotations.

  • Genes with no terms in the requested ontology contribute zero pairwise similarity.

compare_gene_set_pairs_batch(pairs, ontology="BP", similarity="lin", groupwise="bma", counter=...)

  • Fast path for many gene-set pairs.

  • pairs should contain (genes1, genes2) items, where each side is a list of gene symbols.

  • Missing genes are treated as unannotated in batch mode; pairs with no comparable annotated genes yield 0.0.

compare_gene_set_profiles(genes1, genes2, ontology="BP", similarity="lin", groupwise="bma", counter=...)

  • Alternative profile-based method.

  • Converts each gene set to a weighted GO-term profile, where each term weight is the number of genes annotated to that term.

  • Compares the two weighted GO profiles directly with bma, max, avg, hausdorff, or weighted simgic.

Practical behavior and edge cases

  • Invalid or missing GO IDs in similarity calls generally return 0.0.

  • Terms from different namespaces produce 0.0.

  • For normalized methods (for example lin and wang), self-similarity is typically near 1.0.

Distance-oriented workflow

For clustering/embedding workflows, use:

  • gene_distance_matrix

  • gene_set_distance_matrix

  • tsne_genes

  • umap_genes

These functions convert similarity to distance using distance_transform rules (see Visualization (t-SNE / UMAP)).

Mathematical definitions

Resnik

\[\mathrm{Sim}_{Resnik}(t_1, t_2) = IC(\mathrm{MICA}(t_1, t_2))\]

The simplest IC-based measure: the similarity between two terms equals the IC of their MICA. The result is unbounded and depends on the annotation corpus.

Lin

\[\mathrm{Sim}_{Lin}(t_1, t_2) = \frac{2\,IC(\mathrm{MICA}(t_1, t_2))}{IC(t_1)+IC(t_2)}\]

Normalizes Resnik by the individual ICs, producing a score in [0, 1].

Jiang-Conrath (distance-derived similarity)

\[d_{JC} = IC(t_1) + IC(t_2) - 2\,IC(\mathrm{MICA})\]
\[\mathrm{Sim}_{JC} = \frac{1}{1 + d_{JC}}\]

Converts the JC distance into a similarity via the reciprocal transform.

SimRel

\[\mathrm{Sim}_{Rel} = \left(\frac{2\,IC(\mathrm{MICA})}{IC(t_1)+IC(t_2)}\right)\left(1-e^{-IC(\mathrm{MICA})}\right)\]

Combines Lin similarity with a relevance factor that penalizes shallow (low-IC) MICAs.

Wang

Wang similarity uses weighted ancestor contributions from GO graph topology and does not require annotation-based IC values.

Each ancestor a of a term t receives a semantic value S_A(a) based on the path from t to a. Edge weights decay by relationship type:

  • is_a edges contribute a weight of 0.8

  • part_of edges contribute a weight of 0.6

The semantic value of t itself is 1. The overall similarity between two terms is the ratio of their shared semantic contributions to their total semantic values. Because it relies only on graph structure, Wang is useful when annotation data may be biased or incomplete.

TopoICSim

TopoICSim combines topology-aware paths and IC-derived weights to produce a bounded similarity in [0, 1].

It identifies the longest-information-content path between two terms through their common ancestors, weighting path segments by the IC of intermediate nodes. This captures both the structural distance in the DAG and the specificity of the connecting path, making it more discriminating than purely IC-based or purely topological methods.

Bibliography

API reference

batch_similarity(list1, list2, method, counter)

Compute pairwise semantic similarity in batch using a selected method.

Parameters:
  • list1 (list of str) – First list of GO term IDs.

  • list2 (list of str) – Second list of GO term IDs.

  • method (str) – Name of the similarity method.

  • counter (TermCounter) – Precomputed IC values.

Returns:

List of similarity scores.

Return type:

list of float

Raises:

ValueError – If input lists differ in length or method is unknown.

compare_gene_pairs_batch(pairs, ontology, similarity, groupwise, counter)

Compute semantic similarity between genes in batches.

Parameters:
  • pairs (list of (str, str)) – List of pairs of genes to calculate the semantic similarity

  • ontology (str) – Name of the subontology of GO to use: BP, MF or CC.

  • similarity (str) – Name of the similarity method.

  • groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

List of similarity scores.

Return type:

list of float

Raises:

ValueError – If method or combine are unknown.

compare_genes(gene1, gene2, ontology, similarity, groupwise, counter)

Compute semantic similarity between genes.

Parameters:
  • gene1 (str) – Gene symbol of the first gene.

  • gene2 (str) – Gene symbol of the second gene.

  • ontology (str) – Name of the subontology of GO to use: BP, MF or CC.

  • similarity (str) – Name of the similarity method.

  • groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

Similarity score.

Return type:

float

Raises:

ValueError – If method or combine are unknown.

semantic_similarity(id1, id2, method, counter)

Compute semantic similarity between two GO terms using a selected method.

Parameters:
  • id1 (str) – First GO term ID.

  • id2 (str) – Second GO term ID.

  • method (str) – Name of the similarity method. Options: “resnik”, “lin”, etc.

  • counter (TermCounter) – Precomputed IC values.

Returns:

Similarity score.

Return type:

float

Raises:

ValueError – If the method is unknown.

term_ic(go_id, counter)

Compute the Information Content (IC) of a GO term.

Parameters:
  • go_id (str) – GO term identifier.

  • counter (TermCounter) – Precomputed term counter with IC values.

Returns:

The IC of the GO term.

Return type:

float

termset_similarity(terms1, terms2, term_similarity='lin', groupwise='bma', counter=None)

Compute semantic similarity between two sets of GO terms.

Parameters:
  • terms1 (list of str) – First list of GO term IDs.

  • terms2 (list of str) – Second list of GO term IDs.

  • term_similarity (str) – Name of the pairwise similarity method.

  • groupwise (str) – Groupwise combination method. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

Similarity score.

Return type:

float