Annotations and IC

GO3 parses GAF annotations and builds term-level statistics for IC-based similarity methods.

Functions

load_gaf(path)

  • Parses a GAF file and caches gene-to-GO mappings.

  • Returns a list of GAFAnnotation objects.

build_term_counter(annotations)

  • Builds a TermCounter from parsed annotations.

  • Computes counts and Information Content (IC) by namespace.

Information Content formula

IC quantifies how specific a term is within its namespace:

\[IC(t) = -\log\!\left(\frac{\text{count}(t)}{\text{total}(\text{namespace})}\right)\]
  • count(t) is the number of annotations for term t, including annotations propagated from descendant terms. If a gene is annotated to a child term, all of its ancestors also receive that count.

  • total(namespace) is the total annotation count for the namespace (the value stored in counter.total_by_ns).

The total_by_ns dictionary maps each namespace (biological_process, molecular_function, cellular_component) to its total propagated annotation count. This is the denominator in the IC formula.

Filtering rules in load_gaf

During parsing, GO3 applies key biological filters:

  • ND (No biological Data): annotations with evidence code ND are skipped. These indicate that no experimental or computational evidence exists and would add noise to IC calculations.

  • NOT qualifier: annotations whose qualifier contains NOT are skipped. These indicate that a gene is explicitly not associated with a term.

  • Obsolete GO terms: handled automatically:

    • uses replaced_by when available

    • otherwise uses first consider target when available

    • otherwise discards that annotation

These rules affect both downstream scores and benchmark comparability.

Annotation propagation

When build_term_counter computes counts, each annotation is propagated up the DAG: if a gene is annotated to term t, then t and all ancestors of t receive a count increment. This means that general (high-level) terms accumulate many counts and have low IC, while specific (leaf-level) terms have fewer counts and high IC.

Common evidence codes

GAF files record evidence codes that describe how an annotation was determined. GO3 does not filter by evidence code (other than ND), but knowing what they mean helps interpret annotation quality:

Code

Name

Description

IEA

Inferred from Electronic Annotation

Computationally assigned, not curator-reviewed; most abundant code

EXP

Inferred from Experiment

Supported by a published experiment

IDA

Inferred from Direct Assay

Based on a direct experimental assay

IPI

Inferred from Physical Interaction

Based on a physical interaction experiment

IMP

Inferred from Mutant Phenotype

Based on observed mutant phenotype

IGI

Inferred from Genetic Interaction

Based on a genetic interaction experiment

IEP

Inferred from Expression Pattern

Based on gene expression data

TAS

Traceable Author Statement

Directly cited from a published paper

NAS

Non-traceable Author Statement

Author statement without a specific citation

ND

No biological Data

Placeholder when no annotation exists; filtered out by GO3

For stricter filtering (e.g., excluding IEA to use only curated annotations), pre-filter your GAF file before passing it to load_gaf.

Example

import go3

go3.load_go_terms("go-basic.obo")
annotations = go3.load_gaf("goa_human.gaf")
counter = go3.build_term_counter(annotations)

print("Annotations:", len(annotations))
print("IC terms:", len(counter.ic))

Inspecting structures

ann = annotations[0]
print(ann.db_object_id, ann.go_term, ann.evidence)

# Raw annotation count for a term (after propagation)
print(counter.counts.get("GO:0008150", 0))

# Total annotations per namespace (denominator for IC)
print(counter.total_by_ns)

# IC value for a specific term
print(counter.ic.get("GO:0008150", 0.0))

Class reference

GAFAnnotation fields:

  • db_object_id

  • go_term

  • evidence

TermCounter fields:

  • counts – per-term annotation count (after ancestor propagation)

  • total_by_ns – total annotation count per namespace (used as IC denominator)

  • ic – per-term Information Content values

API reference

class GAFAnnotation

Bases: object

Struct representing a single annotation from a GAF file.

Fields

db_object_idstr

The gene product identifier (e.g., UniProt ID).

go_termstr

The GO term ID (e.g., GO:0008150).

evidencestr

The evidence code for the annotation (e.g., IEA).

db_object_id
evidence
go_term
class TermCounter

Bases: object

Struct holding annotation counts and information content (IC) for GO terms.

Fields

countsdict

Mapping from GO term ID to annotation count.

total_by_nsdict

Mapping from namespace to total annotation count.

icdict

Mapping from GO term ID to information content (IC).

counts
ic
total_by_ns
build_term_counter(py_annotations)

Build a term counter (counts, IC) from GAF annotations.

Parameters:

py_annotations (list of GAFAnnotation) – List of GAFAnnotation Python objects.

Returns:

Struct with counts and IC values.

Return type:

TermCounter

load_gaf(path)

Load a GAF annotation file and cache the gene-to-GO mapping.

Parameters:

path (str) – Path to the GAF file.

Returns:

List of parsed GAF annotations.

Return type:

list of GAFAnnotation