# Benchmarks GO3 was benchmarked against five established libraries covering the most commonly used implementations of GO semantic similarity: - **GOATOOLS** 1.3.11 (Python) - **FastSemSim** 1.0.0 (Python) - **GOSemSim** 2.36.0 (R / Bioconductor) - **simona** 1.8.1 (R / Bioconductor) - **TaxaGO** (Rust CLI; `semantic-similarity` binary, invoked with `--propagate-counts` to ensure counts are propagated to parent terms) All workloads use the human GO annotation corpus, Biological Process sub-ontology, and Lin similarity. Gene-level benchmarks use the Best Match Average (BMA) groupwise strategy. ## Setup | Item | Value | |---|---| | GO ontology | `releases/2026-03-25` (41,853 terms) | | Annotations | GOC human GAF, 2026-03-28 (879,127 annotations after filtering NOT/ND and propagating) | | Namespace | Biological Process (BP) | | Term method | Lin | | Gene method | Lin + BMA | | Hardware | Apple M3 Pro (11 logical cores, 18 GB RAM) | | OS / runtime | macOS 26.2, Python 3.12.2, R 4.5.1 | | Threads | 8 (where the tool supports parallelism) | | Protocol | 2 warmup + 5 timed repetitions per size group; medians with bootstrap 95% CI | ## Summary | Workload | GO3 | Fastest alternative | Worst alternative | |---|---|---|---| | Loading + IC | **1.53 s / 768 MB** | FastSemSim 5.44 s | simona 19.07 s | | Term similarity, 5,050 pairs | **2.8 ms** | FastSemSim 10.4 ms (4× slower) | GOSemSim 1,217 s (~4×10⁵ slower) | | Gene similarity, 100 pairs (BMA) | **1.02 s** | FastSemSim 2.39 s (2× slower) | TaxaGO 25.2 s (25× slower) | GO3 is the fastest library in every workload tested. Absolute numbers depend on hardware and dataset versions; see the plots below for scaling behaviour. ## Loading and memory ![Loading time and peak memory](../../imgs/benchmark_loading_time_memory.png) TaxaGO is excluded from the loading comparison because, as a standalone binary, its initialization semantics are not directly comparable to embeddable libraries. GO3 achieves the fastest initialization (1.53 s). GOSemSim and FastSemSim use less peak memory (571 MB and 617 MB respectively) at the cost of substantially longer load times. GOATOOLS requires 2,659 MB — 3.5× more than GO3. ## Batch GO-term similarity ![Batch term similarity](../../imgs/benchmark_batch_similarity.png) At 5,050 term pairs (Lin, BP), GO3 is 4× faster than FastSemSim, 24× faster than GOATOOLS, >6,000× faster than simona, and up to ~4×10⁵ faster than GOSemSim. TaxaGO's curve is approximately flat because its per-invocation startup cost (OBO load ~0.27 s) dominates over the actual matrix computation for these term-set sizes. ## Batch gene similarity ![Batch gene similarity](../../imgs/benchmark_gene_batch_similarity.png) At 100 gene pairs (Lin + BMA, BP), GO3 is 2× faster than FastSemSim, 5× faster than simona, 13× faster than GOATOOLS, 25× faster than GOSemSim, and 25–119× faster than TaxaGO depending on batch size. ## Reading the plots - All runtime axes are log-scale. - Shaded bands show bootstrap 95% confidence intervals over 5 runs. - For very small inputs, fixed per-call overhead can dominate; the practical advantage appears on medium and large workloads. ## Numerical validation Because every library uses a slightly different ancestor-traversal strategy and MICA selection, exact agreement between any two libraries is not expected. A pairwise numerical comparison is provided in [`imgs/validation/Supplementary Material S1.pdf`](https://github.com/Mellandd/go3/blob/master/imgs/validation/Supplementary%20Material%20S1.pdf), generated by [`scripts/validate_cross_tool.py`](https://github.com/Mellandd/go3/blob/master/scripts/validate_cross_tool.py): - **GO3 vs GOATOOLS** — near-perfect agreement (Pearson *r* > 0.97 at term and gene level). This confirms GO3's IC and MICA pipeline matches the reference Python implementation. - **FastSemSim / GOSemSim / simona** — moderate agreement with GO3 (*r* ≈ 0.63–0.95 depending on level and tool), due to alternative MICA-selection strategies documented in each tool's literature. - **TaxaGO** — shows moderate divergence (*r* ≈ 0.20–0.48 in the original analysis without `--propagate-counts`; the present analysis uses this flag to produce IC values that are directly comparable to all other tools). - At the gene level, BMA aggregation smooths term-level discrepancies, so agreement is uniformly higher than at the term level. ## Methodology ### Compared libraries All libraries are invoked through dedicated **runner adapters** under `scripts/runners/`. Each adapter: - reports whether the tool is available on the system; - isolates the tool (subprocess for R / CLI tools) so import/parse costs are fully included in the loading timings; - receives the **same sampled inputs** for every size point, so no tool gets a different workload. ### Loading benchmark Each run spawns a fresh process, so import and parse costs are paid every repeat. Reported metrics: - median wall-clock time (5 runs) - median peak resident memory (RSS) ### Term-pair benchmark Uses **closed term sets**: for each target pair count *P*, the smallest *N* with `C(N,2) ≥ P` is chosen, and every runner sees the same *N* terms. Reported *x*-axis is the actual number of pairs `C(N,2)`. This is required to accommodate TaxaGO, which takes a term set and returns the full *N*×*N* similarity matrix. ### Gene-pair benchmark Disjoint random samples of gene pairs (per size group), restricted to genes with at least 8 BP annotations. Every runner processes the same pair sets. ### All-vs-all gene benchmark For each cohort size *g*, all `g(g−1)/2` pairs are generated. This workload reflects realistic quadratic scenarios: clustering, network construction, or cohort-level exploratory analyses. ### Fairness notes - All libraries receive the same OBO and GAF inputs. - TaxaGO is invoked with `--propagate-counts` so that annotation counts are propagated to parent terms, matching the IC computation strategy used by all other libraries. - Gene-level BMA is not exposed natively by every library; where absent (e.g., TaxaGO), the adapter implements BMA on top of the library's term-pair output, and the reported time covers the full end-to-end pipeline. - Random seeds are fixed (`seed=42`) so samples are reproducible. ## Reproducing the benchmarks The orchestrator is `scripts/benchmark_all.py`. Discovery is automatic: every runner whose underlying tool is available on the system participates. Default profile: ```bash python scripts/benchmark_all.py --outdir imgs ``` Paper-ready profile (larger sizes, more repeats, SVG + PNG output): ```bash python scripts/benchmark_all.py --paper-ready --outdir imgs ``` Restrict to specific libraries: ```bash python scripts/benchmark_all.py --only go3,goatools,fastsemsim --outdir imgs ``` Exclude the heaviest tools: ```bash python scripts/benchmark_all.py --exclude gosemsim,simona --outdir imgs ``` Regenerate plots from an existing results file (no recomputation): ```bash python scripts/benchmark_all.py --replot imgs/benchmark_results.json --outdir imgs ``` ### Output artifacts - `imgs/benchmark_loading_time_memory.png` - `imgs/benchmark_batch_similarity.png` - `imgs/benchmark_gene_batch_similarity.png` - `imgs/benchmark_all_vs_all_gene_similarity.png` - `imgs/benchmark_results.json` — raw timings, medians, confidence intervals, and full experimental metadata (OBO/GAF versions, system info, runner capabilities).