Annotations and IC¶
GO3 parses GAF annotations and builds term-level statistics for IC-based similarity methods.
Functions¶
load_gaf(path)
Parses a GAF file and caches gene-to-GO mappings.
Returns a list of
GAFAnnotationobjects.
build_term_counter(annotations)
Builds a
TermCounterfrom parsed annotations.Computes counts and Information Content (IC) by namespace.
Information Content formula¶
IC quantifies how specific a term is within its namespace:
count(t)is the number of annotations for term t, including annotations propagated from descendant terms. If a gene is annotated to a child term, all of its ancestors also receive that count.total(namespace)is the total annotation count for the namespace (the value stored incounter.total_by_ns).
The total_by_ns dictionary maps each namespace (biological_process, molecular_function, cellular_component) to its total propagated annotation count. This is the denominator in the IC formula.
Filtering rules in load_gaf¶
During parsing, GO3 applies key biological filters:
ND (No biological Data): annotations with evidence code
NDare skipped. These indicate that no experimental or computational evidence exists and would add noise to IC calculations.NOT qualifier: annotations whose qualifier contains
NOTare skipped. These indicate that a gene is explicitly not associated with a term.Obsolete GO terms: handled automatically:
uses
replaced_bywhen availableotherwise uses first
considertarget when availableotherwise discards that annotation
These rules affect both downstream scores and benchmark comparability.
Annotation propagation¶
When build_term_counter computes counts, each annotation is propagated up the DAG: if a gene is annotated to term t, then t and all ancestors of t receive a count increment. This means that general (high-level) terms accumulate many counts and have low IC, while specific (leaf-level) terms have fewer counts and high IC.
Common evidence codes¶
GAF files record evidence codes that describe how an annotation was determined. GO3 does not filter by evidence code (other than ND), but knowing what they mean helps interpret annotation quality:
Code |
Name |
Description |
|---|---|---|
|
Inferred from Electronic Annotation |
Computationally assigned, not curator-reviewed; most abundant code |
|
Inferred from Experiment |
Supported by a published experiment |
|
Inferred from Direct Assay |
Based on a direct experimental assay |
|
Inferred from Physical Interaction |
Based on a physical interaction experiment |
|
Inferred from Mutant Phenotype |
Based on observed mutant phenotype |
|
Inferred from Genetic Interaction |
Based on a genetic interaction experiment |
|
Inferred from Expression Pattern |
Based on gene expression data |
|
Traceable Author Statement |
Directly cited from a published paper |
|
Non-traceable Author Statement |
Author statement without a specific citation |
|
No biological Data |
Placeholder when no annotation exists; filtered out by GO3 |
For stricter filtering (e.g., excluding IEA to use only curated annotations), pre-filter your GAF file before passing it to load_gaf.
Example¶
import go3
go3.load_go_terms("go-basic.obo")
annotations = go3.load_gaf("goa_human.gaf")
counter = go3.build_term_counter(annotations)
print("Annotations:", len(annotations))
print("IC terms:", len(counter.ic))
Inspecting structures¶
ann = annotations[0]
print(ann.db_object_id, ann.go_term, ann.evidence)
# Raw annotation count for a term (after propagation)
print(counter.counts.get("GO:0008150", 0))
# Total annotations per namespace (denominator for IC)
print(counter.total_by_ns)
# IC value for a specific term
print(counter.ic.get("GO:0008150", 0.0))
Class reference¶
GAFAnnotation fields:
db_object_idgo_termevidence
TermCounter fields:
counts– per-term annotation count (after ancestor propagation)total_by_ns– total annotation count per namespace (used as IC denominator)ic– per-term Information Content values
API reference¶
- class GAFAnnotation
Bases:
objectStruct representing a single annotation from a GAF file.
Fields¶
- db_object_idstr
The gene product identifier (e.g., UniProt ID).
- go_termstr
The GO term ID (e.g., GO:0008150).
- evidencestr
The evidence code for the annotation (e.g., IEA).
- db_object_id
- evidence
- go_term
- class TermCounter
Bases:
objectStruct holding annotation counts and information content (IC) for GO terms.
Fields¶
- countsdict
Mapping from GO term ID to annotation count.
- total_by_nsdict
Mapping from namespace to total annotation count.
- icdict
Mapping from GO term ID to information content (IC).
- counts
- ic
- total_by_ns
- build_term_counter(py_annotations)
Build a term counter (counts, IC) from GAF annotations.
- Parameters:
py_annotations (list of GAFAnnotation) – List of GAFAnnotation Python objects.
- Returns:
Struct with counts and IC values.
- Return type:
TermCounter
- load_gaf(path)
Load a GAF annotation file and cache the gene-to-GO mapping.
- Parameters:
path (str) – Path to the GAF file.
- Returns:
List of parsed GAF annotations.
- Return type:
list of GAFAnnotation