# Benchmarks

GO3 was benchmarked against five established libraries covering the most commonly used
implementations of GO semantic similarity:

- **GOATOOLS** 1.3.11 (Python)
- **FastSemSim** 1.0.0 (Python)
- **GOSemSim** 2.36.0 (R / Bioconductor)
- **simona** 1.8.1 (R / Bioconductor)
- **TaxaGO** (Rust CLI; `semantic-similarity` binary, invoked with `--propagate-counts` to ensure counts are propagated to parent terms)

All workloads use the human GO annotation corpus, Biological Process sub-ontology, and
Lin similarity. Gene-level benchmarks use the Best Match Average (BMA) groupwise strategy.

## Setup

| Item | Value |
|---|---|
| GO ontology | `releases/2026-03-25` (41,853 terms) |
| Annotations | GOC human GAF, 2026-03-28 (879,127 annotations after filtering NOT/ND and propagating) |
| Namespace | Biological Process (BP) |
| Term method | Lin |
| Gene method | Lin + BMA |
| Hardware | Apple M3 Pro (11 logical cores, 18 GB RAM) |
| OS / runtime | macOS 26.2, Python 3.12.2, R 4.5.1 |
| Threads | 8 (where the tool supports parallelism) |
| Protocol | 2 warmup + 5 timed repetitions per size group; medians with bootstrap 95% CI |

## Summary

| Workload | GO3 | Fastest alternative | Worst alternative |
|---|---|---|---|
| Loading + IC | **1.53 s / 768 MB** | FastSemSim 5.44 s | simona 19.07 s |
| Term similarity, 5,050 pairs | **2.8 ms** | FastSemSim 10.4 ms (4× slower) | GOSemSim 1,217 s (~4×10⁵ slower) |
| Gene similarity, 100 pairs (BMA) | **1.02 s** | FastSemSim 2.39 s (2× slower) | TaxaGO 25.2 s (25× slower) |

GO3 is the fastest library in every workload tested. Absolute numbers depend on
hardware and dataset versions; see the plots below for scaling behaviour.

## Loading and memory

![Loading time and peak memory](../../imgs/benchmark_loading_time_memory.png)

TaxaGO is excluded from the loading comparison because, as a standalone binary, its
initialization semantics are not directly comparable to embeddable libraries.

GO3 achieves the fastest initialization (1.53 s). GOSemSim and FastSemSim use less peak
memory (571 MB and 617 MB respectively) at the cost of substantially longer load times.
GOATOOLS requires 2,659 MB — 3.5× more than GO3.

## Batch GO-term similarity

![Batch term similarity](../../imgs/benchmark_batch_similarity.png)

At 5,050 term pairs (Lin, BP), GO3 is 4× faster than FastSemSim, 24× faster than
GOATOOLS, >6,000× faster than simona, and up to ~4×10⁵ faster than GOSemSim. TaxaGO's
curve is approximately flat because its per-invocation startup cost (OBO load ~0.27 s)
dominates over the actual matrix computation for these term-set sizes.

## Batch gene similarity

![Batch gene similarity](../../imgs/benchmark_gene_batch_similarity.png)

At 100 gene pairs (Lin + BMA, BP), GO3 is 2× faster than FastSemSim, 5× faster than
simona, 13× faster than GOATOOLS, 25× faster than GOSemSim, and 25–119× faster than
TaxaGO depending on batch size.

## Reading the plots

- All runtime axes are log-scale.
- Shaded bands show bootstrap 95% confidence intervals over 5 runs.
- For very small inputs, fixed per-call overhead can dominate; the practical advantage
  appears on medium and large workloads.

## Numerical validation

Because every library uses a slightly different ancestor-traversal strategy and MICA
selection, exact agreement between any two libraries is not expected. A pairwise
numerical comparison is provided in
[`imgs/validation/Supplementary Material S1.pdf`](https://github.com/Mellandd/go3/blob/master/imgs/validation/Supplementary%20Material%20S1.pdf),
generated by [`scripts/validate_cross_tool.py`](https://github.com/Mellandd/go3/blob/master/scripts/validate_cross_tool.py):

- **GO3 vs GOATOOLS** — near-perfect agreement (Pearson *r* > 0.97 at term and gene
  level). This confirms GO3's IC and MICA pipeline matches the reference Python
  implementation.
- **FastSemSim / GOSemSim / simona** — moderate agreement with GO3 (*r* ≈ 0.63–0.95
  depending on level and tool), due to alternative MICA-selection strategies documented
  in each tool's literature.
- **TaxaGO** — shows moderate divergence (*r* ≈ 0.20–0.48 in the original analysis without
  `--propagate-counts`; the present analysis uses this flag to produce IC values that are
  directly comparable to all other tools).
- At the gene level, BMA aggregation smooths term-level discrepancies, so agreement is
  uniformly higher than at the term level.

## Methodology

### Compared libraries

All libraries are invoked through dedicated **runner adapters** under
`scripts/runners/`. Each adapter:

- reports whether the tool is available on the system;
- isolates the tool (subprocess for R / CLI tools) so import/parse costs are fully
  included in the loading timings;
- receives the **same sampled inputs** for every size point, so no tool gets a different
  workload.

### Loading benchmark

Each run spawns a fresh process, so import and parse costs are paid every repeat.
Reported metrics:

- median wall-clock time (5 runs)
- median peak resident memory (RSS)

### Term-pair benchmark

Uses **closed term sets**: for each target pair count *P*, the smallest *N* with
`C(N,2) ≥ P` is chosen, and every runner sees the same *N* terms. Reported *x*-axis is
the actual number of pairs `C(N,2)`. This is required to accommodate TaxaGO, which
takes a term set and returns the full *N*×*N* similarity matrix.

### Gene-pair benchmark

Disjoint random samples of gene pairs (per size group), restricted to genes with at
least 8 BP annotations. Every runner processes the same pair sets.

### All-vs-all gene benchmark

For each cohort size *g*, all `g(g−1)/2` pairs are generated. This workload reflects
realistic quadratic scenarios: clustering, network construction, or cohort-level
exploratory analyses.

### Fairness notes

- All libraries receive the same OBO and GAF inputs.
- TaxaGO is invoked with `--propagate-counts` so that annotation counts are propagated
  to parent terms, matching the IC computation strategy used by all other libraries.
- Gene-level BMA is not exposed natively by every library; where absent (e.g., TaxaGO),
  the adapter implements BMA on top of the library's term-pair output, and the reported
  time covers the full end-to-end pipeline.
- Random seeds are fixed (`seed=42`) so samples are reproducible.

## Reproducing the benchmarks

The orchestrator is `scripts/benchmark_all.py`. Discovery is automatic: every runner
whose underlying tool is available on the system participates.

Default profile:

```bash
python scripts/benchmark_all.py --outdir imgs
```

Paper-ready profile (larger sizes, more repeats, SVG + PNG output):

```bash
python scripts/benchmark_all.py --paper-ready --outdir imgs
```

Restrict to specific libraries:

```bash
python scripts/benchmark_all.py --only go3,goatools,fastsemsim --outdir imgs
```

Exclude the heaviest tools:

```bash
python scripts/benchmark_all.py --exclude gosemsim,simona --outdir imgs
```

Regenerate plots from an existing results file (no recomputation):

```bash
python scripts/benchmark_all.py --replot imgs/benchmark_results.json --outdir imgs
```

### Output artifacts

- `imgs/benchmark_loading_time_memory.png`
- `imgs/benchmark_batch_similarity.png`
- `imgs/benchmark_gene_batch_similarity.png`
- `imgs/benchmark_all_vs_all_gene_similarity.png`
- `imgs/benchmark_results.json` — raw timings, medians, confidence intervals, and full
  experimental metadata (OBO/GAF versions, system info, runner capabilities).