Architecture

This page describes GO3’s internal design and explains why it achieves 8–25x speedups over pure-Python GO semantic similarity libraries.

Overview

GO3 is a Rust core compiled into a CPython extension module via PyO3 and Maturin. All performance-critical code (parsing, graph traversal, IC computation, similarity calculations) runs in compiled Rust, while the Python API provides a familiar interface for scripting and integration.

Data flow

OBO file ──parse──▶ GO graph cache (terms, edges, depths)
                         │
                         ▼
GAF file ──parse──▶ gene-to-GO mapping ──propagate──▶ TermCounter (counts, IC)
                                                           │
                                                           ▼
                                               similarity / batch / matrix APIs
  1. OBO parsing: the OBO file is parsed into an in-memory directed acyclic graph. Each term stores its parents, children, depth, level, and metadata. Ancestor sets are precomputed and cached.

  2. GAF parsing: the GAF file is parsed to build a gene-to-GO mapping. Obsolete terms are resolved via replaced_by / consider. ND and NOT annotations are filtered.

  3. Counter construction: annotation counts are propagated up the DAG (each annotation increments the term and all its ancestors). IC is computed per term per namespace.

  4. Similarity computation: pairwise or batch similarity is computed using cached graphs, ancestor sets, and IC values.

Key design decisions

Global caches

GO3 maintains several global caches within the Python process:

  • Ontology graph: the full term DAG with parent/child edges.

  • Ancestor sets: precomputed is_a ancestor sets for every term, enabling O(1) lookups for common-ancestor queries.

  • DCA cache: deepest common ancestor results are cached to avoid redundant DAG traversals.

  • Gene-to-GO mapping: maps gene symbols to their annotated GO terms per namespace.

  • IC values: per-term Information Content, computed once by build_term_counter.

Caching trades memory for speed. For a typical human analysis, total memory usage is ~100–200 MB.

Rayon thread pool

Batch operations (batch_similarity, compare_gene_pairs_batch, gene_distance_matrix) use Rayon to parallelize pair evaluation across CPU cores. The thread pool is initialized once via set_num_threads and reused for all subsequent parallel work.

FxHashMap

GO3 uses FxHashMap (a fast, non-cryptographic hash map) for internal lookups. FxHashMap is significantly faster than the standard HashMap for string keys like GO IDs, reducing lookup overhead in hot loops.

Why it’s fast

The combination of these design choices eliminates the main bottlenecks in pure-Python implementations:

  • No interpreter overhead in hot loops: similarity computations, ancestor lookups, and IC queries all run in compiled Rust. Python is only involved at the API boundary.

  • Parallel pair evaluation: Rayon distributes independent pair computations across threads. For N pairs on T threads, wall time scales roughly as N/T.

  • O(1) cached lookups: precomputed ancestor sets and cached DCA results avoid redundant graph traversals. In pure-Python tools, these traversals are often the dominant cost.

  • Cache-friendly data structures: Rust’s contiguous memory layout and FxHashMap’s low overhead reduce cache misses compared to Python’s object-heavy, pointer-chasing data structures.