Skip to content

API reference

This page documents the public surface of LLM-PathwayCurator.
Most users should start with the CLI (llm-pathway-curator run ...). The Python API exists for integration and reproducible orchestration.


CLI

Primary entry point: - llm-pathway-curator run ...

The CLI runs the end-to-end pipeline:

EvidenceTable → distill → modules → claims → audit → report


Pipeline

End-to-end orchestration (recommended integration point).

llm_pathway_curator.pipeline

RunConfig dataclass

RunConfig(
    evidence_table,
    sample_card,
    outdir,
    force=False,
    seed=None,
    run_meta_name="run_meta.json",
    tau=None,
    k_claims=None,
    stress_evidence_dropout_p=None,
    stress_evidence_dropout_min_keep=None,
    stress_contradictory_p=None,
    stress_contradictory_max_extra=None,
)

Pipeline run configuration.

Parameters:

  • evidence_table (str) –

    Path to the input EvidenceTable TSV.

  • sample_card (str) –

    Path to the SampleCard JSON.

  • outdir (str) –

    Output directory path.

  • force (bool, default: False ) –

    If True, allow writing into a non-empty outdir.

  • seed (int | None, default: None ) –

    Random seed used for deterministic steps.

  • run_meta_name (str, default: 'run_meta.json' ) –

    File name for run metadata JSON written under outdir.

  • tau (float | None, default: None ) –

    Optional override for audit threshold tau. If None, uses card.audit_tau().

  • k_claims (int | None, default: None ) –

    Optional override for number of claims to propose.

  • stress_evidence_dropout_p (float | None, default: None ) –

    Probability for evidence gene dropout stress test.

  • stress_evidence_dropout_min_keep (int | None, default: None ) –

    Minimum number of genes to keep per term under dropout stress.

  • stress_contradictory_p (float | None, default: None ) –

    Probability to inject contradictory direction claims.

  • stress_contradictory_max_extra (int | None, default: None ) –

    Cap for number of injected contradictory rows.

Notes

This config is designed to be JSON-serializable via dataclasses.asdict.

run_pipeline

run_pipeline(cfg, *, run_id=None)

Run the full LLM-PathwayCurator pipeline.

Parameters:

  • cfg (RunConfig) –

    Run configuration.

  • run_id (str | None, default: None ) –

    Optional explicit run id. If None, a run id is generated.

Returns:

  • RunResult

    Summary of the run, including artifact paths and meta_path.

Raises:

  • FileNotFoundError

    If required input files are missing.

  • IsADirectoryError

    If a required input path is a directory.

  • FileExistsError

    If outdir is non-empty and cfg.force is False.

  • RuntimeError

    If a required step produces zero rows.

  • Exception

    Any exception raised by underlying steps is propagated after writing run_meta status="error".

Notes

Step order: distill -> modules -> select_claims -> context_review -> stress -> audit -> report -> report_jsonl.

Artifacts and run metadata are written into cfg.outdir. The run_meta.json is updated at each step to support reproducibility and debugging.

Environment variables

Many behaviors can be controlled via env vars, including: - Backend and modes: LLMPATH_BACKEND, LLMPATH_CLAIM_MODE - Context: LLMPATH_CONTEXT_ (gate/review/corpus/weights/rerank) - Stress: LLMPATH_STRESS_ (dropout/contradictory)


Contracts

EvidenceTable (TSV contract)

EvidenceTable is the normalized term × supporting-genes table used by all downstream stages. It is the stability boundary: if the EvidenceTable is valid, distill/modules/select/audit/report should not break.

llm_pathway_curator.schema

EvidenceTable schema gate for LLM-PathwayCurator. This module defines the tool-facing EvidenceTable contract (v1) that preserves term×gene relationships across enrichment analysis tools (ORA, fgsea/GSEA, etc.). It provides robust IO, conservative column aliasing, spec-owned evidence parsing (delegated to _shared), and provenance metadata (df.attrs) for auditability.

EvidenceTable dataclass

EvidenceTable(df)

Tool-facing EvidenceTable wrapper.

This class normalizes heterogeneous enrichment analysis outputs into a stable, auditable internal representation that preserves term×gene relationships.

Notes
  • The contract requires non-empty term_id, term_name, and evidence_genes.
  • Parsing/normalization of gene tokens is spec-owned by llm_pathway_curator._shared (e.g., parse_genes, clean_gene_token), to avoid contract drift.
  • Provenance and health summaries are recorded in df.attrs.

read_tsv classmethod

read_tsv(path, *, strict=False, drop_invalid=True)

Read and normalize an evidence table to the contract (v1).

This is the main schema gate that: - aliases common column variants to contract names - cleans required fields - parses evidence_genes via _shared.parse_genes - normalizes numeric fields (stat, qval, pval) - validates the term×gene contract - optionally computes q-values from p-values (BH) when q-values are missing - records provenance and health metrics in df.attrs

Parameters:

  • path (str) –

    Input evidence table path.

  • strict (bool, default: False ) –

    If True, the first invalid row raises ValueError. If False, invalid rows are marked (and optionally dropped). Default is False.

  • drop_invalid (bool, default: True ) –

    If True, drop rows with is_valid=False. Default is True. If False, keep invalid rows and rely on is_valid downstream.

Returns:

Raises:

  • ValueError

    If core required columns are missing after aliasing. If strict=True and an invalid row is encountered.

Notes

Contract-required columns (core) - term_id - term_name - stat - evidence_genes

Output guarantees (post-normalization) - evidence_genes is a list-like object per row (and evidence_genes_str is TSV-safe) - direction is normalized (typically 'up', 'down', 'na') - df.attrs contains: contract_version, read_mode, aliasing, health

Examples:

>>> et = EvidenceTable.read_tsv("evidence_table.tsv")
>>> info = et.summarize()
>>> et.write_tsv("normalized_evidence_table.tsv")

summarize

summarize()

Summarize the normalized EvidenceTable for logging and QA.

Returns:

  • dict[str, object]

    Summary dictionary including: - contract version - number of terms and sources - direction counts - evidence genes per term quantiles - q-value provenance counts - df.attrs['health'] and df.attrs['aliasing'] (if present)

write_tsv

write_tsv(path)

Write the normalized EvidenceTable to a TSV file.

This writer: - applies a small Excel formula-injection defense for common text fields - serializes evidence_genes as a TSV-friendly string column - emits a stable column order for reproducibility

Parameters:

  • path (str) –

    Output TSV path.

Notes
  • evidence_genes is written as a joined string under the column name evidence_genes (list form is dropped).
  • Normalized contract columns are emitted first; remaining columns are sorted.

Sample Card (study context contract)

The Sample Card is a structured record of study intent/context (e.g., condition/tissue/perturbation/comparison), used by proposal steps and context validity gates.

llm_pathway_curator.sample_card

SampleCard

Bases: BaseModel

SampleCard: tool-facing context and knob container.

Attributes:

  • condition, tissue, perturbation, comparison (str) –

    Core context keys normalized into stable strings.

  • notes (str or None) –

    Optional free-form notes for humans.

  • context_tokens_text (str or None) –

    Optional free-form text used to derive deterministic context tokens.

  • context_tokens_policy (dict[str, Any]) –

    Tokenization policy for deterministic context tokens.

  • context_tokens_meta (dict[str, Any]) –

    Optional metadata for provenance logging.

  • k_claims_value (int) –

    Top-level k_claims value (stored under JSON key "k_claims").

  • extra (dict[str, Any]) –

    Tool knobs and future-compatible fields. Flattened + alias-canonicalized.

Notes

Contract: - Core context keys are normalized strings; NA is represented by NA_TOKEN. - The neutral disease-like key is "condition" (legacy keys accepted on input). - k_claims is top-level only; it is not stored inside extra. - extra keeps unknown keys for forward compatibility.

apply_patch

apply_patch(patch)

Apply a patch dictionary and return a new SampleCard.

Parameters:

  • patch (dict[str, Any]) –

    Patch values. Core keys are applied at top-level. Other keys are merged into extra.

Returns:

  • SampleCard

    New SampleCard instance with patch applied.

Notes

Contract enforcement: - Never keeps k_claims or its aliases inside extra. - Accepts legacy disease-like keys to fill "condition" when missing.

audit_min_gene_overlap

audit_min_gene_overlap(default=1)

Get minimum gene overlap for evidence drift checks.

Parameters:

  • default (int, default: 1 ) –

    Default value, by default 1.

Returns:

  • int

    Minimum overlap threshold.

audit_tau

audit_tau(default=0.8)

Get audit stability tau.

Parameters:

  • default (float, default: 0.8 ) –

    Default value, by default 0.8.

Returns:

  • float

    Tau value used by the audit layer.

claim_mode

claim_mode(default='deterministic')

Get claim generation mode.

Parameters:

  • default (str, default: 'deterministic' ) –

    Default mode, by default "deterministic".

Returns:

  • str

    One of {"deterministic", "llm"}.

context_dict

context_dict()

Return core context keys as a dictionary.

Returns:

  • dict[str, str]

    Mapping from CORE_KEYS to their normalized values.

context_gate_mode

context_gate_mode(default='hard')

Get context gate mode for audit integration.

Parameters:

  • default (str, default: 'hard' ) –

    Default mode, by default "hard".

Returns:

  • str

    Gate mode normalized to {"off", "note", "hard"}.

context_key

context_key()

Build a stable composite context key string.

Returns:

  • str

    "condition|tissue|perturbation|comparison" using normalized fields.

context_tokens

context_tokens()

Compute deterministic context tokens used for anchoring.

Returns:

  • list[str]

    Deterministic token list.

Notes

Priority: 1) context_tokens_text is tokenized via ctx_tokens_v1. 2) Fallback: core context fields are concatenated and tokenized.

context_tokens_effective

context_tokens_effective()

Build a provenance payload for logging (pure function).

Returns:

  • dict[str, Any]

    Dictionary containing: - version - tokens - n - signature - policy

context_tokens_signature

context_tokens_signature()

Compute a stable short signature for current context tokens.

Returns:

  • str

    12-hex sha256-based signature.

context_tokens_version

context_tokens_version()

Get effective context tokenization version.

Returns:

  • str

    Policy version string (currently "ctx_tokens_v1").

enable_context_score_proxy

enable_context_score_proxy(default=False)

Get whether proxy context scoring is enabled.

Parameters:

  • default (bool, default: False ) –

    Default behavior, by default False.

Returns:

  • bool

    True if proxy context scoring is enabled.

from_json classmethod

from_json(path)

Load a SampleCard from a JSON file (tool contract).

Parameters:

  • path (str or Path) –

    Path to a JSON file containing a SampleCard object.

Returns:

  • SampleCard

    Parsed and normalized SampleCard instance.

Raises:

  • FileNotFoundError

    If the file does not exist.

  • ValueError

    If JSON is invalid or not a dict-like object.

Notes

Backward compatibility: - Accepts legacy disease-like keys and hoists into "condition". - Allows k_claims stored in extra or via aliases, but hoists to top-level. - Removes k_claims and its aliases from extra on load.

hub_frac_thr

hub_frac_thr(default=0.5)

Get hub fraction threshold for ABSTAIN_HUB_BRIDGE gating.

Parameters:

  • default (float, default: 0.5 ) –

    Default threshold, by default 0.5.

Returns:

  • float

    Fraction clamped into [0, 1].

hub_term_degree

hub_term_degree(default=200)

Get hub gene degree threshold for hub-bridge gating.

Parameters:

  • default (int, default: 200 ) –

    Default threshold, by default 200.

Returns:

  • int

    Threshold (>= 1).

k_claims

k_claims(default=3)

Get number of claims to generate.

Parameters:

  • default (int, default: 3 ) –

    Default count, by default 3.

Returns:

  • int

    Number of claims (>= 1).

Notes

Top-level k_claims_value has priority. Extra is fallback only.

max_per_module

max_per_module(default=1)

Get maximum claims per module (diversity control).

Parameters:

  • default (int, default: 1 ) –

    Default value, by default 1.

Returns:

  • int

    Maximum per module (>= 1).

min_union_genes

min_union_genes(default=3)

Get minimum union evidence genes required for support.

Parameters:

  • default (int, default: 3 ) –

    Default minimum, by default 3.

Returns:

  • int

    Minimum union size (>= 1).

pass_notes

pass_notes(default=True)

Decide whether to emit compact notes for PASS rows.

Parameters:

  • default (bool, default: True ) –

    Default behavior, by default True.

Returns:

  • bool

    True if PASS rows may receive a short note (e.g., "ok").

preselect_tau_gate

preselect_tau_gate(default=False)

Get whether preselection should apply a tau gate.

Parameters:

  • default (bool, default: False ) –

    Default behavior, by default False.

Returns:

  • bool

    True if preselection tau gating is enabled.

stability_gate_mode

stability_gate_mode(default='hard')

Get stability gate mode.

Parameters:

  • default (str, default: 'hard' ) –

    Default mode, by default "hard".

Returns:

  • str

    Gate mode normalized to {"off", "note", "hard"}.

stress_gate_mode

stress_gate_mode(default='off')

Get stress gate mode for audit integration.

Parameters:

  • default (str, default: 'off' ) –

    Default mode, by default "off".

Returns:

  • str

    Gate mode normalized to {"off", "note", "hard"}.

strict_evidence_check

strict_evidence_check(default=False)

Get strict evidence linkage policy.

Parameters:

  • default (bool, default: False ) –

    Default behavior, by default False.

Returns:

  • bool

    If True, missing evidence linkage becomes schema violation in audit.

to_json

to_json(path, *, indent=2)

Serialize this SampleCard to JSON.

Parameters:

  • path (str or Path) –

    Output path.

  • indent (int, default: 2 ) –

    JSON indentation level, by default 2.

Returns:

  • None

    Writes the file.

Notes

Uses model_dump(by_alias=True) so the JSON key is "k_claims".

trust_input_survival

trust_input_survival(default=False)

Decide whether to trust survival values provided in inputs.

Parameters:

  • default (bool, default: False ) –

    Default behavior, by default False.

Returns:

  • bool

    True if tool should trust input survival rather than recomputing.

Claim schema (typed JSON)

Claims are schema-bounded decision objects with resolvable evidence links (term/module identifiers + hashes). Free-text narratives are not treated as evidence.

llm_pathway_curator.claim_schema

Typed, auditable claim schema for LLM-PathwayCurator.

This module defines strict Pydantic models for: - Evidence references (term IDs, optional gene IDs, module ID) - Typed claims (entity, direction, context keys) - Audit decisions (PASS/ABSTAIN/FAIL + reason codes)

Design: - Claim and evidence identifiers are tool-owned and deterministic. - Free-text evidence is disallowed; evidence must be referenced by IDs. - Optional context review fields are supported for audit gating.

Notes
  • Status vocabulary is intentionally strict to keep denominators auditable.
  • Gene ID casing is preserved for display; hashing follows tool-wide spec.

AuditedClaim

Bases: BaseModel

Stable audited container.

Attributes:

  • claim (Claim) –

    Typed claim object.

  • decision (Decision) –

    Mechanical decision and reason codes.

Notes

This object is intended as the unit of record for JSONL reports.

Claim

Bases: BaseModel

Typed claim with auditable evidence linkage.

Attributes:

  • claim_id (str) –

    Tool-owned stable identifier. If empty, it is filled deterministically.

  • entity (str) –

    Stable entity identifier (prefer IDs over free text).

  • direction ({'up', 'down', 'na'}) –

    Canonical direction token.

  • context_keys (list of {"condition", "tissue", "perturbation", "comparison"}) –

    Keys the claim is conditioned on. Values live in SampleCard.

  • evidence_ref (EvidenceRef) –

    Evidence reference (IDs only; no free-text evidence).

Optional context review fields

context_evaluated : bool Whether context relevance review was executed. context_method : {"llm", "proxy", "none"} Method used for context review. context_status : {"PASS", "WARN", "FAIL"} or None Result of context review. context_reason : str or None Short reason (length-limited). context_notes : str or None Additional notes (length-limited).

Notes

Invariants enforced: - If context_evaluated is False: method="none" and status/reason/notes are cleared. - If context_evaluated is True: method must be "llm" or "proxy" and status must be provided.

Decision

Bases: BaseModel

Mechanical audit decision for a claim.

Attributes:

  • status ({'PASS', 'ABSTAIN', 'FAIL'}) –

    Final decision label.

  • reason (str) –

    Reason code. Must be "ok" or one of ALL_REASONS.

  • details (dict) –

    Optional structured metadata for debugging or reporting.

Raises:

  • ValueError

    If reason is not in the allowed vocabulary.

EvidenceRef

Bases: BaseModel

Evidence reference container (strict, tool-friendly).

Attributes:

  • term_ids (list of str) –

    Required. One or more term UID strings that define evidence.

  • gene_set_hash (str) –

    Optional input. If missing/invalid, it is deterministically filled: - from gene_ids when available, else - from term_ids as a fallback.

  • gene_ids (list of str) –

    Optional. Evidence genes for display and hashing (tool spec).

  • module_id (str) –

    Optional. Module identifier for module-level evidence.

Notes
  • gene_set_hash must be a 12-hex digest (sha256[:12]).
  • Extra fields are allowed to support non-breaking provenance flags (e.g., gene_set_hash_source).
  • Term IDs are not uppercased.

Core stages (A → B → C)

A) Stability distillation (evidence hygiene)

Generates stability proxies from supporting-gene perturbations (e.g., LOO/jackknife-like survival scores). This stage does not decide PASS/ABSTAIN/FAIL.

llm_pathway_curator.distill

distill_evidence

distill_evidence(evidence, card, *, seed=None)

Distill evidence into stability/provenance features (A-stage; deterministic).

This function performs evidence hygiene and produces per-term stability proxies without re-running enrichment. Two modes are supported:

  • evidence_perturb (default): perturb evidence genes deterministically and compute term survival as the fraction of perturbations that preserve evidence similarity.
  • replicates_proxy: compute proxy survival from replicate-stacked evidence tables (requires replicate_id; not true patient-level re-run LOO enrichment).

Parameters:

  • evidence (DataFrame) –

    Normalized EvidenceTable-like dataframe with required columns: term_id, term_name, source, stat, qval, direction, evidence_genes.

  • card (SampleCard) –

    Sample card controlling distill knobs under extra (namespaced as distill_*).

  • seed (int or None, default: None ) –

    Global seed for deterministic per-term perturbations.

Returns:

  • DataFrame

    Distilled table containing stable join keys (term_uid), TSV-friendly genes, survival fields, and knob provenance columns used by downstream modules/audit/report.

Raises:

  • ValueError

    If required columns are missing, stat is non-numeric, evidence_genes is empty, or replicates_proxy is requested but replicate requirements are not met.

Notes
  • This stage measures stability and records provenance; it does not decide PASS/ABSTAIN/FAIL.
  • Contract-critical: term×gene must be preserved post-masking (≥1 evidence gene per term).

B) Evidence modules (term–gene factorization)

Constructs the term–gene bipartite graph and extracts evidence modules that preserve shared vs distinct support. This stage does not decide PASS/ABSTAIN/FAIL.

llm_pathway_curator.modules

ModuleOutputs dataclass

ModuleOutputs(modules_df, term_modules_df, edges_df)

Container for module factorization outputs.

Attributes:

  • modules_df (DataFrame) –

    Per-module summary table. One row per module_id. Contains stable hashes (terms/genes/content) and representative genes, plus optional survival fields if computed upstream.

  • term_modules_df (DataFrame) –

    Term-to-module assignment table. Contract: one module_id per term_uid.

  • edges_df (DataFrame) –

    Filtered term-by-gene edge table used for module construction. Columns: term_uid, gene_id, weight. Additional debug/provenance lives in edges_df.attrs.

attach_module_drift_stress_tag

attach_module_drift_stress_tag(
    distilled_df,
    drift_df,
    *,
    term_id_col="term_uid",
    stress_col="stress_tag",
    tag="module_drift",
)

Annotate terms with a stress tag when module assignment drifted.

Parameters:

  • distilled_df (DataFrame) –

    Distilled evidence table with term_id_col and an optional stress tag column.

  • drift_df (DataFrame) –

    Drift table containing term_id_col and module_drift (bool).

  • term_id_col (str, default: 'term_uid' ) –

    Term identifier column name (default "term_uid").

  • stress_col (str, default: 'stress_tag' ) –

    Column name used to store stress tags (default "stress_tag").

  • tag (str, default: 'module_drift' ) –

    Tag value to append when drift is detected (default "module_drift").

Returns:

  • DataFrame

    Copy of distilled_df with updated stress_col. Existing tags are preserved and the new tag is appended if missing.

Raises:

  • ValueError

    If required columns are missing.

Notes
  • Does not overwrite non-empty tags; it appends.
  • Tag splitting/joining is delegated to _shared.split_tags and _shared.join_tags.

attach_module_ids

attach_module_ids(
    evidence_df,
    term_modules_df,
    *,
    term_id_col="term_uid",
    modules_df=None,
)

Attach module identifiers to an evidence table by term_uid.

Parameters:

  • evidence_df (DataFrame) –

    Evidence table that includes term_id_col (typically "term_uid").

  • term_modules_df (DataFrame) –

    Term-to-module table with columns term_id_col and module_id.

  • term_id_col (str, default: 'term_uid' ) –

    Join key column name for term identifiers.

  • modules_df (DataFrame | None, default: None ) –

    Optional per-module table. If provided, module-level survival fields are joined onto each term row.

Returns:

  • DataFrame

    Copy of evidence_df with: - module_id - module_id_missing (bool) and, optionally, module survival columns if modules_df was provided.

Raises:

  • ValueError

    If required columns are missing.

build_term_gene_edges

build_term_gene_edges(
    evidence_df,
    *,
    term_id_col="term_uid",
    genes_col="evidence_genes",
)

Build term-by-gene bipartite edges from an evidence table.

Parameters:

  • evidence_df (DataFrame) –

    Evidence table containing at least a term identifier column and a gene evidence column.

  • term_id_col (str, default: 'term_uid' ) –

    Column name for the term identifier in evidence_df.

  • genes_col (str, default: 'evidence_genes' ) –

    Column name for evidence genes in evidence_df. Values can be list-like (preferred) or legacy scalar strings.

Returns:

  • DataFrame

    Edge table with columns: - term_uid : str - gene_id : str - weight : float

    The returned DataFrame also stores a small provenance dict under out.attrs["edges"].

Raises:

  • ValueError

    If required columns are missing.

Notes
  • Empty/invalid gene lists produce no edges and are dropped.
  • List-like gene inputs are processed via vectorized explode.
  • Scalar/string inputs are parsed via _shared.parse_genes.
  • Duplicate (term_uid, gene_id) edges are summed into a single row with weight equal to the multiplicity.

compute_term_module_drift

compute_term_module_drift(
    baseline_term_modules_df,
    stressed_term_modules_df,
    *,
    term_id_col="term_uid",
)

Compute per-term drift of module assignment under stress.

Parameters:

  • baseline_term_modules_df (DataFrame) –

    Baseline term-to-module assignments.

  • stressed_term_modules_df (DataFrame) –

    Stressed term-to-module assignments.

  • term_id_col (str, default: 'term_uid' ) –

    Term identifier column name (default "term_uid").

Returns:

  • DataFrame

    Drift table with columns: - term_uid - module_id_base - module_id_stress - module_drift (bool)

Raises:

  • ValueError

    If inputs do not have required columns or violate the one-term-one-module contract.

factorize_modules_connected_components

factorize_modules_connected_components(
    evidence_df,
    *,
    method="term_jaccard_cc",
    module_prefix="M",
    max_gene_term_degree=None,
    max_term_degree=None,
    hub_degree_quantile=0.995,
    min_shared_genes=3,
    jaccard_min=0.1,
    term_id_col="term_uid",
    genes_col="evidence_genes",
    sparsity_mode="auto",
    shared_pos_target=0.1,
    sparse_relax_min_shared_genes=2,
    sparse_relax_jaccard_min=0.02,
    pair_sample_max=200000,
    seed=42,
)

Factorize enrichment evidence into stable "evidence modules".

This constructs a term-by-gene bipartite graph from an evidence table and groups related terms into modules. Module identity is stable: module_id is derived from a content hash of (terms, genes).

Parameters:

  • evidence_df (DataFrame) –

    Evidence table containing term identifiers and evidence genes.

  • method (ModuleMethod, default: 'term_jaccard_cc' ) –

    Module construction method. - "term_jaccard_cc": connected components on a term-term graph derived from shared genes (recommended). - "bipartite_cc": connected components on the bipartite graph (legacy).

  • module_prefix (str, default: 'M' ) –

    Prefix prepended to the module_id (default "M").

  • max_gene_term_degree (int | None, default: None ) –

    If set, removes genes whose term-degree is strictly greater than this threshold before module construction.

  • max_term_degree (int | None, default: None ) –

    Deprecated alias for max_gene_term_degree.

  • hub_degree_quantile (float | None, default: 0.995 ) –

    If not None and explicit thresholds are not given, infer the hub degree threshold from the specified quantile of gene term-degree.

  • min_shared_genes (int, default: 3 ) –

    Minimum shared genes for term-term edges (term_jaccard_cc).

  • jaccard_min (float, default: 0.1 ) –

    Minimum Jaccard similarity for term-term edges (term_jaccard_cc).

  • term_id_col (str, default: 'term_uid' ) –

    Column name in evidence_df holding the term identifier. The pipeline convention is "term_uid".

  • genes_col (str, default: 'evidence_genes' ) –

    Column name in evidence_df holding evidence genes.

  • sparsity_mode (Literal['auto', 'off'], default: 'auto' ) –

    If "auto", relaxes thresholds for sparse graphs and may tighten thresholds to avoid giant-component collapse.

  • shared_pos_target (float, default: 0.1 ) –

    Target lower bound for P(shared_genes > 0) under auto sparsity tuning.

  • sparse_relax_min_shared_genes (int, default: 2 ) –

    Relaxed min_shared_genes used when sparsity is detected.

  • sparse_relax_jaccard_min (float, default: 0.02 ) –

    Relaxed jaccard_min used when sparsity is detected.

  • pair_sample_max (int, default: 200000 ) –

    Maximum number of term pairs sampled for sparsity diagnostics.

  • seed (int, default: 42 ) –

    Random seed for sampling-based diagnostics.

Returns:

  • ModuleOutputs

    Object containing: - modules_df: per-module summary table - term_modules_df: term_uid -> module_id assignments (one per term) - edges_df: filtered edge table used to build modules

Raises:

  • ValueError

    If an unknown method is requested, required columns are missing, or the term->module contract is violated.

Notes
  • Hub filtering and sparsity/giant-component heuristics are recorded in edges_df.attrs["modules"] for reproducibility and debugging.
  • module_id is stable and derived from module content, not from component numbering.

filter_hub_genes

filter_hub_genes(
    edges, *, max_gene_term_degree=200, max_term_degree=None
)

Remove hub genes that connect too many terms (high gene term-degree).

Parameters:

  • edges (DataFrame) –

    Edge table with columns term_uid and gene_id.

  • max_gene_term_degree (int | None, default: 200 ) –

    Hub threshold. Genes with term-degree strictly greater than this value are removed. If None, no hub filtering is applied.

  • max_term_degree (int | None, default: None ) –

    Deprecated alias for max_gene_term_degree. If provided and max_gene_term_degree is None, it is used as the threshold.

Returns:

  • DataFrame

    Filtered edge table. Hub filter metadata is recorded in out.attrs["hub_filter"].

Raises:

  • ValueError

    If edges does not have the required columns.

Notes

The filter uses a strict condition: degree > threshold (not >=).

summarize_module_drift

summarize_module_drift(drift_df)

Summarize module drift statistics.

Parameters:

  • drift_df (DataFrame) –

    Output of compute_term_module_drift with required columns: term_uid, module_id_base, module_id_stress, module_drift.

Returns:

  • dict

    Summary metrics including: - n_terms_total, n_terms_drift, term_drift_rate - n_modules_base, n_modules_stress, n_modules_shared - module_churn_rate

C1) Proposal (deterministic baseline / LLM proposal-only)

Proposes typed, evidence-linked candidate claims from distilled evidence and modules. Final acceptance is not decided here.

llm_pathway_curator.select

select_claims

select_claims(
    distilled,
    card,
    *,
    k=50,
    mode=None,
    backend=None,
    claim_backend=None,
    review_backend=None,
    context_gate_mode="soft",
    context_review_mode="off",
    seed=None,
    outdir=None,
    **kwargs,
)

C1: Propose schema-locked pathway claims from distilled evidence.

Parameters:

  • distilled (DataFrame) –

    Distilled evidence table (optionally with module_id and context fields).

  • card (SampleCard) –

    Sample card providing context and selection knobs.

  • k (int, default: 50 ) –

    Number of claims to propose.

  • mode (str or None, default: None ) –

    "deterministic" or "llm". If None, resolved from env/card.

  • backend (BaseLLMBackend or None, default: None ) –

    Backend used for LLM claim proposal when mode="llm".

  • claim_backend (BaseLLMBackend or None, default: None ) –

    Reserved for role-based backends (currently not required here).

  • review_backend (BaseLLMBackend or None, default: None ) –

    Backend used for LLM context review (shortlist-only).

  • context_gate_mode (str, default: 'soft' ) –

    Public API legacy default is "soft". Canonical gate modes are off/note/hard; "soft" is ignored to preserve old behavior.

  • context_review_mode (str, default: 'off' ) –

    "off" or "llm". When "llm", fills pipeline-owned context fields before ranking / proposal.

  • seed (int or None, default: None ) –

    Seed for deterministic tie-breaks and optional stress probes.

  • outdir (str or None, default: None ) –

    Output directory for small caches and artifacts.

  • **kwargs (Any, default: {} ) –

    Forward-compatible extra arguments (ignored here).

Returns:

  • DataFrame

    Proposed claims table. Includes decision-grade claim_json that embeds EvidenceRef with gene_ids and gene_set_hash.

Notes

Selection-time context knobs (env): - LLMPATH_SELECT_CONTEXT_MODE = off|proxy|review - LLMPATH_SELECT_CONTEXT_GATE_MODE = off|note|hard

Pipeline-owned context review columns (if present) are never overwritten except when LLM review is requested and the existing method is not "llm".

llm_pathway_curator.llm_claims

LLM-based claim proposal for LLM-PathwayCurator.

This module proposes structured Claim objects from distilled evidence using an LLM backend. It is designed to be: - contract-driven (stable IDs, deterministic evidence linking), - robust across heterogeneous backends (OpenAI/Gemini/Ollama/local), - audit-grade (persist prompt/candidates/raw/meta artifacts).

Key ideas
  • Evidence identity is tool-owned (term_uid + gene_set_hash).
  • Context VALUES are prompt-facing; context KEYS are contract-facing.
  • FAIL decisions are never "promoted" by thresholding; gating affects non-FAIL.
Notes

This file contains many private helpers. Public entrypoints: - propose_claims_llm - claims_to_proposed_tsv

LLMClaimResult dataclass

LLMClaimResult(
    claims, raw_text, used_fallback, notes, meta
)

Container for LLM claim proposal results.

Attributes:

  • claims (list[Claim]) –

    Validated and post-processed claims. Empty if failure/fallback.

  • raw_text (str) –

    Raw JSON text persisted for audit/debug.

  • used_fallback (bool) –

    True if LLM output was unusable or a soft-error occurred.

  • notes (str) –

    Compact status note (e.g., "ok", "post_validate_failed: ...").

  • meta (dict[str, Any]) –

    Metadata used for reproducibility (k, top_n, hashes, backend class, etc.).

build_claim_prompt

build_claim_prompt(*, card, candidates, k)

Build a compact JSON-only prompt for proposing claims.

Parameters:

  • card (SampleCard) –

    Sample card providing context values and stable context keys.

  • candidates (DataFrame) –

    Candidate evidence rows (top_n pool) used as the ONLY selectable source. Expected columns include term_uid, term_id, term_name, direction, and optionally term_survival and gene_ids_suggest/evidence_genes.

  • k (int) –

    Target number of claims to request from the model.

Returns:

  • str

    Prompt string instructing the model to return valid JSON only.

Notes

The prompt enforces copy-exact rules for: - entity == term_id - evidence_ref.term_ids == [term_uid] Context values are prompt-facing only; identity uses context KEYS.

claims_to_proposed_tsv

claims_to_proposed_tsv(
    *, claims, distilled_with_modules, card
)

Convert proposed claims into a flat TSV-like DataFrame for export.

Parameters:

  • claims (list[Claim]) –

    Proposed claims (typically from propose_claims_llm).

  • distilled_with_modules (DataFrame) –

    Distilled evidence table used to enrich exported rows with term metadata.

  • card (SampleCard) –

    Sample card providing context values (export columns).

Returns:

  • DataFrame

    Row-wise export with fields including: claim_id, entity, direction, context_keys, term_uid, module_id, gene_ids, term_ids, gene_set_hash, and serialized claim_json.

Notes

Context VALUES are exported as columns for convenience, but MUST NOT be baked into identity (claim_id / gene_set_hash).

propose_claims_llm

propose_claims_llm(
    *,
    distilled_with_modules,
    card,
    backend,
    k,
    seed=None,
    outdir=None,
    artifact_tag=None,
)

Propose claims via an LLM and write audit-grade artifacts.

Parameters:

  • distilled_with_modules (DataFrame) –

    Distilled evidence table with module information (or sufficient columns to derive term_uid). Must contain: - term_uid OR (source, term_id) - term_id, term_name, source Optional: - module_id, gene_set_hash - evidence_genes / evidence_genes_str / gene_ids_suggest - keep_term, term_survival, stat, context_score

  • card (SampleCard) –

    Sample card providing prompt context and contract keys.

  • backend (BaseLLMBackend) –

    LLM backend adapter.

  • k (int) –

    Target number of claims.

  • seed (int or None, default: None ) –

    Optional seed (best-effort; may be ignored).

  • outdir (str or None, default: None ) –

    Output directory for artifacts. If None, no artifacts are written.

  • artifact_tag (str or None, default: None ) –

    Optional tag to avoid overwriting per-call artifacts.

Returns:

  • LLMClaimResult

    Claims and metadata. On failure, claims may be empty and used_fallback True.

Raises:

  • ValueError

    If required columns are missing.

  • RuntimeError

    If LLM is required by contract and call/validation fails.

Notes

Artifacts (when outdir is set): - llm_claims.prompt.json - llm_claims.candidates.json - llm_claims.raw.json - llm_claims.meta.json Plus tagged variants when artifact_tag is provided.

C2) Mechanical audit (decider)

Assigns PASS/ABSTAIN/FAIL with precedence (FAIL > ABSTAIN > PASS) using predefined audit gates. Produces standardized reason codes and audit logs.

llm_pathway_curator.audit

audit_claims

audit_claims(claims, distilled, card, *, tau=None)

Mechanically audit claims against distilled evidence and sample context.

Parameters:

  • claims (DataFrame) –

    Claims table. Must include claim_json with Claim schema JSON.

  • distilled (DataFrame) –

    Distilled evidence table. Must provide term linkage via term_uid or (source, term_id). Evidence genes are read from evidence_genes or evidence_genes_str. Stability uses term_survival when available.

  • card (SampleCard) –

    Sample card providing audit knobs and gate modes.

  • tau (float or None, default: None ) –

    Override stability tau. If None, uses card.audit_tau().

Returns:

  • DataFrame

    Audited claims with status, reasons, and audit notes.

Raises:

  • ValueError

    If distilled cannot provide term linkage (missing required columns).

Notes

Status priority is: FAIL > ABSTAIN > PASS.

Major checks: - Linkage: term_id -> term_uid resolution; reject unknown/ambiguous terms. - Evidence identity: gene_set_hash match against computed union evidence genes. - Stability: term-level survival aggregation (min across referenced terms). - Under-support: minimum union evidence genes. - Hub-bridge: abstain when evidence is dominated by hub genes. - Context gate: uses claim schema context review, with optional proxy fallback. - Stress probes: optional internal dropout and contradiction probes and/or external stress columns; treated as ABSTAIN (inconclusive), not FAIL.

llm_pathway_curator.audit_reasons

is_abstain_reason

is_abstain_reason(code)

Check whether a reason code is an ABSTAIN reason.

Parameters:

  • code (str) –

    Reason code string.

Returns:

  • bool

    True if code is in ABSTAIN_REASONS, otherwise False.

Notes

ABSTAIN_REASONS is part of the paper's reproducible output contract and should remain stable.

is_decision_reason

is_decision_reason(code)

Check whether a string is a valid decision reason code.

This includes the sentinel "ok" as well as all known FAIL/ABSTAIN reason codes.

Parameters:

  • code (str) –

    Decision reason code.

Returns:

  • bool

    True if code is "ok" or is included in ALL_REASONS, otherwise False.

is_fail_reason

is_fail_reason(code)

Check whether a reason code is a FAIL reason.

Parameters:

  • code (str) –

    Reason code string.

Returns:

  • bool

    True if code is in FAIL_REASONS, otherwise False.

Notes

FAIL_REASONS is part of the paper's reproducible output contract and should remain stable.

is_known_reason

is_known_reason(code)

Check whether a reason code is known by this module.

Parameters:

  • code (str) –

    Reason code string.

Returns:

  • bool

    True if code is in ALL_REASONS, otherwise False.

Notes

ALL_REASONS excludes "ok" by design. Use is_decision_reason() when you want to accept the "ok" sentinel.

C3) Reporting (decision-grade outputs)

Writes decision objects (report.jsonl / report.md) and renders audit logs with provenance.

llm_pathway_curator.report

write_report

write_report(audit_log, distilled, card, outdir)

Write a human-facing markdown report and TSV artifacts.

Outputs
  • out/report.md (human-facing summary)
  • out/audit_log.tsv (canonicalized audit log)
  • out/distilled.tsv (stringified distilled evidence table)
  • out/risk_coverage.tsv (optional; when calibration functions exist)

Parameters:

  • audit_log (DataFrame) –

    Audit log DataFrame containing PASS/ABSTAIN/FAIL outcomes and supporting fields.

  • distilled (DataFrame) –

    Distilled evidence table DataFrame.

  • card (SampleCard) –

    SampleCard providing analysis context (condition/tissue/etc.).

  • outdir (str) –

    Output directory path.

Returns:

  • None
Notes
  • This function does NOT write report.jsonl. JSONL export is explicit via write_report_jsonl(...).
  • Gene symbol mapping in this report is DISPLAY-ONLY: it does not affect auditing or evidence identity.
  • The report remains best-effort and will fall back to a minimal report if required decision columns are missing.

write_report_jsonl

write_report_jsonl(
    audit_log,
    card,
    outdir,
    *,
    run_id,
    method=None,
    tau=None,
    condition=None,
    comparison=None,
    cancer=None,
    disease=None,
)

Write an audit-grade JSONL report artifact (out/report.jsonl).

This export is designed to be robust and reproducible: - Accepts claim_json or common fallbacks as the payload source. - If typed Claim validation fails, emits a minimal stub instead of crashing. - Missing metric columns do not crash the export (nulls are emitted).

Parameters:

  • audit_log (DataFrame) –

    Audit log DataFrame. Required columns: - status - claim JSON payload column (one of: claim_json, claim_json_str, claim_json_raw). If missing, the payload is synthesized from audit-log columns when possible.

  • card (SampleCard) –

    SampleCard used to supply context defaults and optional metadata.

  • outdir (str) –

    Output directory path.

  • run_id (str) –

    Run identifier string. If empty, a UTC timestamp is used.

  • method (str | None, default: None ) –

    Method label. Default is "llm-pathway-curator".

  • tau (float | None, default: None ) –

    Tau value to store in the JSONL. If None, resolves from card.

  • condition (str | None, default: None ) –

    Optional override for the condition label stored in JSONL.

  • comparison (str | None, default: None ) –

    Optional override for the comparison label stored in JSONL.

  • cancer (str | None, default: None ) –

    Backward-compatible alias for condition (discouraged for new use).

  • disease (str | None, default: None ) –

    Backward-compatible alias for condition (discouraged for new use).

Returns:

  • Path

    Path to the written report.jsonl.

Raises:

  • ValueError

    If required columns are missing and the claim payload cannot be synthesized.

Notes
  • This function does not write report.md. Use write_report for the human-facing markdown report.
  • Developer-only metadata can be enabled via LLMPATH_REPORT_INCLUDE_DEV_META.

Backends (proposal-only LLM)

LLM backends are used only for proposal steps (representative selection + typing) when enabled. Backends should support deterministic settings where possible and persist prompt/raw/meta artifacts.

llm_pathway_curator.backends

BaseLLMBackend

Bases: ABC

Backend-agnostic LLM interface.

This class defines a minimal contract for generating text or JSON strings.

Contract

Input prompt : str

Output json_mode=False Returns a single string (free-form). Implementations may return a human-readable error string on failure. json_mode=True Must return either: (a) a valid JSON string parseable by json.loads, or (b) a standardized soft error JSON string: {"error": {"message": "...", "type": "...", "retryable": true/false}}

Notes

Convenience aliases are provided (invoke, call, complete, chat, and *_json helpers). Subclasses should implement generate.

call

call(prompt, **kwargs)

Alias for invoke.

Parameters:

  • prompt (str) –

    Input prompt string.

  • **kwargs (Any, default: {} ) –

    Optional keyword arguments.

Returns:

  • str

    Model output string.

chat

chat(messages, **kwargs)

Best-effort chat wrapper.

Parameters:

  • messages (Any) –

    Chat-like messages. Typically a list of dicts or strings. If a list is provided, the last element's "content" field (if dict) is used as prompt.

  • **kwargs (Any, default: {} ) –

    Optional keyword arguments passed to invoke.

Returns:

  • str

    Model output string.

Notes

This is intentionally lightweight and is not a full chat protocol implementation. It extracts a prompt and delegates to invoke.

chat_json

chat_json(prompt, **kwargs)

Generate JSON output from a prompt (chat-style helper).

Parameters:

  • prompt (str) –

    Input prompt string.

  • **kwargs (Any, default: {} ) –

    Optional keyword arguments (ignored except for future compatibility).

Returns:

  • str

    JSON string or standardized soft error JSON string.

complete

complete(prompt, **kwargs)

Alias for invoke.

Parameters:

  • prompt (str) –

    Input prompt string.

  • **kwargs (Any, default: {} ) –

    Optional keyword arguments.

Returns:

  • str

    Model output string.

complete_json

complete_json(prompt, **kwargs)

Generate JSON output from a prompt (completion-style helper).

Parameters:

  • prompt (str) –

    Input prompt string.

  • **kwargs (Any, default: {} ) –

    Optional keyword arguments (ignored except for future compatibility).

Returns:

  • str

    JSON string or standardized soft error JSON string.

generate abstractmethod

generate(prompt, json_mode=False)

Generate a completion for a given prompt.

Parameters:

  • prompt (str) –

    Input prompt string.

  • json_mode (bool, default: False ) –

    If True, the backend must return a JSON string (or a standardized soft error JSON). If False, free-form text is allowed.

Returns:

  • str

    Model output. See class-level contract for json_mode behavior.

Raises:

  • NotImplementedError

    If the backend does not implement this method.

generate_json

generate_json(prompt, **kwargs)

Generate JSON output from a prompt (explicit helper).

Parameters:

  • prompt (str) –

    Input prompt string.

  • **kwargs (Any, default: {} ) –

    Optional keyword arguments (ignored except for future compatibility).

Returns:

  • str

    JSON string or standardized soft error JSON string.

invoke

invoke(prompt, **kwargs)

Invoke the backend with a prompt (alias for generate).

Parameters:

  • prompt (str) –

    Input prompt string.

  • **kwargs (Any, default: {} ) –

    Optional keyword arguments. json_mode is recognized.

Returns:

  • str

    Model output string.

json

json(prompt, **kwargs)

Alias for JSON generation helpers.

Parameters:

  • prompt (str) –

    Input prompt string.

  • **kwargs (Any, default: {} ) –

    Optional keyword arguments.

Returns:

  • str

    JSON string or standardized soft error JSON string.

GeminiBackend

GeminiBackend(
    api_key,
    model_name="models/gemini-2.0-flash",
    temperature=0.0,
)

Bases: BaseLLMBackend

Google Gemini backend via google-generativeai.

Parameters:

  • api_key (str) –

    Gemini API key.

  • model_name (str, default: 'models/gemini-2.0-flash' ) –

    Gemini model identifier (e.g., "models/gemini-2.0-flash").

  • temperature (float, default: 0.0 ) –

    Sampling temperature.

Notes
  • In json_mode, response is requested with MIME type "application/json" and validated. Non-JSON output is converted to standardized soft error JSON.

Initialize the Gemini backend.

Parameters:

  • api_key (str) –

    Gemini API key.

  • model_name (str, default: 'models/gemini-2.0-flash' ) –

    Gemini model identifier.

  • temperature (float, default: 0.0 ) –

    Sampling temperature.

Raises:

  • ImportError

    If google-generativeai is not installed.

generate

generate(prompt, json_mode=False)

Generate a completion using Gemini.

Parameters:

  • prompt (str) –

    Input prompt string.

  • json_mode (bool, default: False ) –

    If True, attempts to enforce JSON output and validates with json.loads.

Returns:

  • str

    Free-form text (json_mode=False), or a JSON string / standardized soft error JSON (json_mode=True).

LocalLLMBackend

Bases: BaseLLMBackend

Local/offline backend stub.

This backend does not perform real generation. It exists to support offline workflows and testing paths.

Notes
  • In json_mode, returns a standardized soft error JSON payload.
  • In text mode, returns a human-readable placeholder string.

generate

generate(prompt, json_mode=False)

Return a placeholder response (local/offline stub).

Parameters:

  • prompt (str) –

    Input prompt string (ignored).

  • json_mode (bool, default: False ) –

    If True, returns standardized soft error JSON.

Returns:

  • str

    Placeholder text or standardized soft error JSON.

OllamaBackend

OllamaBackend(
    host=None,
    model_name=None,
    temperature=None,
    timeout=None,
)

Bases: BaseLLMBackend

Ollama backend using HTTP API (/api/generate) via urllib.

Parameters:

  • host (str | None, default: None ) –

    Ollama server base URL (e.g., "http://ollama:11434").

  • model_name (str | None, default: None ) –

    Ollama model name (e.g., "llama3.1:8b").

  • temperature (float | None, default: None ) –

    Sampling temperature.

  • timeout (float | None, default: None ) –

    Legacy single timeout (seconds) applied to both connect/read timeouts.

Notes
  • urllib accepts a single timeout value. This implementation stores both connect/read timeouts but uses read_timeout for urllib's timeout.
  • In json_mode, payload includes "format": "json" and output is validated. Non-JSON output is converted to standardized soft error JSON.

Initialize the Ollama backend.

Parameters:

  • host (str | None, default: None ) –

    Base URL for Ollama server. If None, falls back to env defaults.

  • model_name (str | None, default: None ) –

    Model name. If None, falls back to env defaults.

  • temperature (float | None, default: None ) –

    Sampling temperature. If None, falls back to env default.

  • timeout (float | None, default: None ) –

    Legacy single timeout applied to both connect/read.

Notes

Timeout resolution supports: - New envs: LPC_OLLAMA_CONNECT_TIMEOUT / LLMPATH_OLLAMA_CONNECT_TIMEOUT LPC_OLLAMA_READ_TIMEOUT / LLMPATH_OLLAMA_READ_TIMEOUT - Legacy env: LPC_OLLAMA_TIMEOUT / LLMPATH_OLLAMA_TIMEOUT

generate

generate(prompt, json_mode=False)

Generate a completion using Ollama /api/generate.

Parameters:

  • prompt (str) –

    Input prompt string.

  • json_mode (bool, default: False ) –

    If True, requests JSON output and validates with json.loads.

Returns:

  • str

    Free-form text (json_mode=False), or a JSON string / standardized soft error JSON (json_mode=True).

Notes
  • Adaptive read-timeout escalation is applied on timeout errors: read_timeout *= factor up to a max, for a limited number of escalations.
  • connect_timeout is stored for metadata/documentation only and is not used by urllib (single-timeout limitation).

OpenAIBackend

OpenAIBackend(
    api_key, model_name="gpt-4o", temperature=0.0, seed=42
)

Bases: BaseLLMBackend

OpenAI backend using the openai Python SDK (chat completions).

Parameters:

  • api_key (str) –

    OpenAI API key.

  • model_name (str, default: 'gpt-4o' ) –

    Model name (e.g., "gpt-4o").

  • temperature (float, default: 0.0 ) –

    Sampling temperature.

  • seed (int, default: 42 ) –

    Seed used when supported by the API/model. If seeding fails, a fallback call without seed is attempted.

Notes
  • In json_mode, response_format={"type": "json_object"} is used and output is validated. Non-JSON output is converted to standardized soft error JSON.

Initialize the OpenAI backend.

Parameters:

  • api_key (str) –

    OpenAI API key.

  • model_name (str, default: 'gpt-4o' ) –

    Model name.

  • temperature (float, default: 0.0 ) –

    Sampling temperature.

  • seed (int, default: 42 ) –

    Seed value for deterministic sampling when supported.

Raises:

  • ImportError

    If the openai package is not installed.

generate

generate(prompt, json_mode=False)

Generate a completion using OpenAI chat completions.

Parameters:

  • prompt (str) –

    Input prompt string.

  • json_mode (bool, default: False ) –

    If True, requests JSON object output and validates with json.loads.

Returns:

  • str

    Free-form text (json_mode=False), or a JSON string / standardized soft error JSON (json_mode=True).

Notes

If the seeded call fails, a second call without seed is attempted.

get_backend_from_env

get_backend_from_env(seed=None)

Create an LLM backend based on environment variables.

Parameters:

  • seed (int | None, default: None ) –

    Optional seed for backends that support seeded generation.

Returns:

Raises:

  • KeyError

    If a required API key is missing for the selected backend.

  • ValueError

    If the backend name is unknown.

Notes

Backend selection envs (first non-empty wins): - LPC_BACKEND, BACKEND, LLMPATH_BACKEND

Supported backends: - "openai": uses OpenAI chat completions - "gemini": uses Google Generative AI - "ollama": uses Ollama HTTP API - "local" / "offline": stub backend (no real generation)

Compatibility: - Both "LLMPATH_" and "LPC_" prefixes are accepted for most settings. - For overlapping keys, LPC_ is preferred over vendor env, then LLMPATH_.

retry_with_backoff

retry_with_backoff(retries=3, backoff_in_seconds=1.0)

Decorator factory for exponential backoff retries on backend calls.

Parameters:

  • retries (int, default: 3 ) –

    Maximum number of retry attempts (not counting the initial call).

  • backoff_in_seconds (float, default: 1.0 ) –

    Base backoff duration in seconds. Sleep time grows as: backoff_in_seconds * 2**attempt, with small jitter.

Returns:

  • callable

    A decorator that wraps a function and retries under certain conditions.

Retry conditions
  • Retryable exceptions inferred by message heuristics (status/keywords).
  • Legacy plain-text soft errors: "OpenAI Error: ...", "Gemini Error: ...", "Ollama Error: ..."
  • Standardized soft error JSON payloads: {"error": {"message": "...", "type": "...", "retryable": ...}}
  • When json_mode=True: invalid JSON outputs are treated as parse failures and retried at most once.
Notes

json_mode is inferred from kwargs (json_mode=) or from positional ABI: (self, prompt, json_mode=False) when present.


Adapters (Input → EvidenceTable)

Adapters normalize upstream enrichment outputs into the EvidenceTable contract. They are intentionally conservative: preserve evidence identity (term × genes), avoid destructive parsing, and keep TSV round-trips stable.

llm_pathway_curator.adapters.fgsea

FgseaAdapterConfig dataclass

FgseaAdapterConfig(
    source_name="fgsea",
    require_genes=True,
    keep_pval=True,
    term_id_mode="raw",
    drop_na_qval=True,
    sort_output=True,
)

Configuration for converting an fgsea result table to EvidenceTable.

Attributes:

  • source_name (str) –

    Value to populate the EvidenceTable source column.

  • require_genes (bool) –

    If True, raise an error when leadingEdge yields no genes.

  • keep_pval (bool) –

    If True and pval exists, store it separately (does not replace qval).

  • term_id_mode (str) –

    Term identifier policy.

    • "raw": term_id == pathway (recommended; paper-aligned)
    • "prefixed_hashed": term_id == "FGSEA:<slug>|<hash>" (legacy)
  • drop_na_qval (bool) –

    If True, drop rows where qval (padj) is missing.

  • sort_output (bool) –

    If True, sort output deterministically by qval asc then abs(stat) desc.

Notes

Defaults are chosen to match the paper-side EvidenceTable behavior: human-readable term IDs, stable ordering, and dropping NA q-values.

read_fgsea_table

read_fgsea_table(path)

Read an fgsea result table from disk.

Supports TSV by default and falls back to delimiter sniffing or whitespace parsing (best-effort).

Parameters:

  • path (str) –

    Path to an fgsea result file.

Returns:

  • DataFrame

    Parsed fgsea table.

fgsea_to_evidence_table

fgsea_to_evidence_table(fgsea_df, *, config=None)

Convert an fgsea result table to the EvidenceTable contract.

Parameters:

  • fgsea_df (DataFrame) –

    fgsea results table. Must contain (after aliasing) pathway and leadingEdge plus at least one statistic column among NES/ES.

  • config (FgseaAdapterConfig or None, default: None ) –

    Conversion configuration. If None, defaults are used.

Returns:

  • DataFrame

    EvidenceTable with core columns:

    • term_id : str
    • term_name : str
    • source : str
    • stat : float
    • qval : float or NA (from padj only)
    • direction : {"up", "down", "na"}
    • evidence_genes : list[str]

    Plus minimal provenance fields (e.g., pval, term_id_h).

Raises:

  • ValueError

    If required columns are missing, if no stat column is present, if pathway is empty, if the stat column is non-numeric, or if require_genes=True and evidence genes are empty.

Notes
  • Only padj is treated as q-value (FDR) and mapped to qval. pval is stored separately when present and enabled.
  • Output ordering can be stabilized via sort_output.

convert_fgsea_table_to_evidence_tsv

convert_fgsea_table_to_evidence_tsv(
    in_path, out_path, *, config=None
)

Read an fgsea table, convert it, and write an EvidenceTable TSV.

This is a convenience wrapper around: read_fgsea_table -> fgsea_to_evidence_table -> TSV write.

Parameters:

  • in_path (str) –

    Path to the fgsea result file.

  • out_path (str) –

    Destination path for the EvidenceTable TSV.

  • config (FgseaAdapterConfig or None, default: None ) –

    Conversion configuration. If None, defaults are used.

Returns:

  • DataFrame

    EvidenceTable as written, with evidence_genes serialized for TSV.

Raises:

  • ValueError

    Propagated from fgsea_to_evidence_table on invalid inputs.

llm_pathway_curator.adapters.metascape

MetascapeAdapterConfig dataclass

MetascapeAdapterConfig(
    source_name="metascape",
    sheet_name="Enrichment",
    include_summary=False,
    prefer_symbols=True,
    strict_qval=False,
    drop_na_qval=True,
)

Configuration for converting Metascape exports to an EvidenceTable.

Attributes:

  • source_name (str) –

    Value to populate the EvidenceTable source column.

  • sheet_name (str) –

    Excel sheet to read when the input is .xlsx/.xls.

  • include_summary (bool) –

    Whether to include rows whose GroupID ends with "_Summary". The default is False to avoid summary rows being treated as evidence.

  • prefer_symbols (bool) –

    Prefer the Symbols column over Genes when both exist.

  • strict_qval (bool) –

    If True, raise an error when Log(q-value) is present but no valid q-values can be reconstructed.

  • drop_na_qval (bool) –

    If True, drop rows whose reconstructed q-value is missing.

read_metascape_table

read_metascape_table(path, *, sheet_name='Enrichment')

Read a Metascape export file into a DataFrame.

Supports Excel exports (.xlsx/.xls) and delimited text inputs. For Excel, the Enrichment sheet is the canonical input.

Parameters:

  • path (str) –

    Path to a Metascape export file.

  • sheet_name (str, default: 'Enrichment' ) –

    Sheet to read for Excel inputs. Default is "Enrichment".

Returns:

  • DataFrame

    Parsed Metascape table.

metascape_to_evidence_table

metascape_to_evidence_table(metascape_df, *, config=None)

Convert a Metascape Enrichment table to the EvidenceTable contract.

The resulting EvidenceTable is term-centric (one row per term) and carries evidence genes suitable for downstream factorization.

Parameters:

  • metascape_df (DataFrame) –

    Metascape "Enrichment" sheet as a DataFrame.

  • config (MetascapeAdapterConfig or None, default: None ) –

    Conversion configuration. If None, defaults are used.

Returns:

  • DataFrame

    EvidenceTable with (at minimum) these columns:

    • term_id : str
    • term_name : str
    • source : str
    • stat : float
    • qval : float
    • direction : str (Metascape ORA yields "na")
    • evidence_genes : list[str]

    Plus provenance/optional columns (e.g., group_id, is_summary).

Raises:

  • ValueError

    If required columns are missing, if evidence genes are empty for any row, if Term/Description are empty, or if statistic columns are non-numeric.

Notes
  • q-values are reconstructed from Log(q-value) using sign inference.
  • stat is made monotone-positive by taking abs(...) of the chosen log column, for ranking and paper-friendly plotting.

convert_metascape_table_to_evidence_tsv

convert_metascape_table_to_evidence_tsv(
    in_path, out_path, *, config=None
)

Read a Metascape export, convert it, and write an EvidenceTable TSV.

This is a convenience wrapper around: read_metascape_table -> metascape_to_evidence_table -> TSV write.

Parameters:

  • in_path (str) –

    Path to the Metascape export (Excel or text).

  • out_path (str) –

    Destination path for the EvidenceTable TSV.

  • config (MetascapeAdapterConfig or None, default: None ) –

    Conversion configuration. If None, defaults are used.

Returns:

  • DataFrame

    EvidenceTable as written, with evidence_genes serialized for TSV.

Raises:

  • ValueError

    Propagated from metascape_to_evidence_table when inputs are invalid or evidence cannot be constructed.


Calibration (risk–coverage)

Utilities for selecting an operating point (e.g., τ) along the risk–coverage trade-off. This stage does not change evidence identity; it tunes conservativeness.

llm_pathway_curator.calibrate

CalibrationResult dataclass

CalibrationResult(method, params)

Calibration result object.

Attributes:

  • method ({'none', 'temperature', 'isotonic'}) –

    Calibration method identifier.

  • params (dict[str, Any]) –

    Method parameters: - temperature: {"T": float} - isotonic: {"model": fitted_model} - none: {}

Notes

This object is serializable only when params are JSON-safe. (isotonic model objects are not JSON-serializable by default.)

apply

apply(probs)

Apply the calibration mapping to probability-like scores.

Parameters:

  • probs (ndarray) –

    Probability array.

Returns:

  • ndarray

    Calibrated probabilities clipped to (0, 1).

Raises:

  • ValueError

    If method is unknown or required params are missing.

apply_isotonic

apply_isotonic(model, probs)

Apply a fitted isotonic regression model to probabilities.

Parameters:

  • model (Any) –

    Fitted isotonic regression model with predict.

  • probs (ndarray) –

    Probability array.

Returns:

  • ndarray

    Calibrated probabilities (float array).

apply_temperature_scaling

apply_temperature_scaling(probs, T)

Apply temperature scaling to probability-like scores in [0, 1].

Parameters:

  • probs (ndarray) –

    1D probability-like array.

  • T (float) –

    Temperature parameter (must be finite and > 0).

Returns:

  • ndarray

    Calibrated probabilities clipped to (0, 1).

Raises:

  • ValueError

    If T is invalid.

calibrate_probs

calibrate_probs(
    probs,
    y_true,
    *,
    method="temperature",
    allow_unlabeled=False,
)

Stage-2 calibration entry point.

Parameters:

  • probs (ndarray) –

    1D probability-like array in [0, 1].

  • y_true (ndarray or None) –

    Optional binary labels in {0, 1}.

  • method (('none', 'temperature', 'isotonic'), default: "none" ) –

    Calibration method. Default is "temperature".

  • allow_unlabeled (bool, default: False ) –

    If True and y_true is None, returns a no-op calibration ("none"). If False and y_true is None, refuses to fit.

Returns:

Raises:

  • ValueError

    If inputs are invalid or fitting is requested without labels.

Notes

Design intent: - Keep dependencies optional (no scipy). - Temperature scaling uses deterministic grid search.

compute_counts

compute_counts(status)

Count PASS/FAIL/ABSTAIN/TOTAL from a status series (strict validation).

Parameters:

  • status (Series) –

    Status values. Must normalize into {"PASS", "ABSTAIN", "FAIL"}.

Returns:

  • dict[str, int]

    Counts with keys: {"PASS", "FAIL", "ABSTAIN", "TOTAL"}.

Raises:

  • ValueError

    If unknown status values are present (strict spec validation).

extract_probs_and_labels

extract_probs_and_labels(
    audit_log, *, prob_col, label_col=None
)

Extract probability-like scores and optional strict binary labels.

Parameters:

  • audit_log (DataFrame) –

    Audit log table.

  • prob_col (str) –

    Column name containing probabilities/scores.

  • label_col (str or None, default: None ) –

    Column name containing labels. Only exact {0,1} accepted.

Returns:

  • tuple[ndarray, ndarray or None]

    (probs, labels). Labels are returned as int array when provided.

Raises:

  • ValueError

    If columns are missing or values are non-numeric/non-finite, or labels are not exactly binary {0,1}.

fit_isotonic_regression

fit_isotonic_regression(probs, y_true)

Fit isotonic regression mapping probs -> calibrated probs.

Parameters:

  • probs (ndarray) –

    1D probability-like array in [0, 1].

  • y_true (ndarray) –

    1D binary labels in {0, 1}.

Returns:

  • Any

    Fitted isotonic regression model (scikit-learn object).

Raises:

  • ImportError

    If scikit-learn is not available.

  • ValueError

    If inputs are invalid.

fit_temperature_scaling

fit_temperature_scaling(
    probs, y_true, *, grid=(0.25, 10.0, 80)
)

Fit a single temperature T > 0 by minimizing NLL (binary labels).

Model

p' = sigmoid(logit(p) / T)

Parameters:

  • probs (ndarray) –

    1D probability-like array in [0, 1].

  • y_true (ndarray) –

    1D binary labels in {0, 1}.

  • grid (tuple[float, float, int], default: (0.25, 10.0, 80) ) –

    (t_min, t_max, n_grid). Search is performed in log-space.

Returns:

  • float

    Best temperature T, clipped to a conservative range [0.25, 10.0].

Raises:

  • ValueError

    If inputs are invalid or the grid is invalid.

Notes

No scipy dependency: uses deterministic grid search.

risk_coverage_curve

risk_coverage_curve(
    df,
    *,
    score_col,
    status_col="status",
    decision_thresholds=None,
    pass_if_score_ge=True,
    promote_abstain=True,
    fail_on_degenerate=False,
    max_thresholds=200,
)

Build a Risk–Coverage curve by sweeping a PASS threshold.

Parameters:

  • df (DataFrame) –

    Input table containing score and status columns.

  • score_col (str) –

    Column name of probability-like or score values.

  • status_col (str, default: 'status' ) –

    Column name of base status. Default is "status".

  • decision_thresholds (list of float or None, default: None ) –

    Thresholds to sweep. If None, thresholds are derived from scores.

  • pass_if_score_ge (bool, default: True ) –

    If True, PASS when score >= threshold; else PASS when score <= threshold.

  • promote_abstain (bool, default: True ) –

    If True, among non-FAIL items reassign: PASS if threshold satisfied else ABSTAIN. If False, gate only existing PASS -> ABSTAIN below threshold.

  • fail_on_degenerate (bool, default: False ) –

    If True, raise on degenerate score distributions (<=1 unique value).

  • max_thresholds (int, default: 200 ) –

    Max thresholds when auto-deriving. Must be >= 10.

Returns:

  • DataFrame

    One row per threshold with risk/coverage metrics and metadata fields: threshold, score_col, status_col, pass_if_score_ge, promote_abstain.

Raises:

  • ValueError

    If required columns are missing, scores are invalid, statuses are invalid, or thresholds are empty/invalid.

Notes

Safety semantics: - FAIL is never changed. - ABSTAIN never enters the risk denominator.

risk_coverage_from_status

risk_coverage_from_status(status)

Compute spec-safe Risk/Coverage metrics from a status series.

Parameters:

  • status (Series) –

    Status values in {"PASS", "ABSTAIN", "FAIL"}.

Returns:

  • dict[str, float]

    Metrics with explicit denominators:

    • coverage_pass_total PASS / TOTAL
    • coverage_decided_total (PASS + FAIL) / TOTAL
    • risk_fail_given_decided FAIL / (PASS + FAIL)
    • risk_fail_total FAIL / TOTAL
    • fail_rate_total Alias of FAIL / TOTAL (kept for backward compatibility)

    Also includes count fields as floats: n_pass, n_fail, n_abstain, n_decided, n_total

Notes

"decided" = PASS ∪ FAIL (ABSTAIN excluded). FAIL is a negative decision produced by mechanical audits.


Shared utilities (spec-level)

Spec-critical helpers for contract stability (NA handling, gene parsing/joining, stable hashes). If you need to compare outputs across versions, this is the layer that prevents drift.

llm_pathway_curator._shared

canonical_sorted_unique

canonical_sorted_unique(xs)

Canonicalize a list of values into sorted unique strings.

Parameters:

  • xs (list of object) –

    Input values.

Returns:

  • list of str

    Sorted unique tokens after trimming and NA filtering.

clean_gene_token

clean_gene_token(g)

Clean a single gene-like token conservatively.

Parameters:

  • g (object) –

    Gene-like token.

Returns:

  • str

    Cleaned token.

Notes
  • Trims whitespace and strips simple quote wrappers.
  • Removes common list/export wrappers (brackets, trailing separators).
  • Does NOT force uppercase (species/ID-system dependent).

dedup_preserve_order

dedup_preserve_order(items)

De-duplicate strings while preserving first occurrence order.

Parameters:

  • items (list of str) –

    Input tokens.

Returns:

  • list of str

    Deduplicated tokens in first-seen order.

Notes

Empty strings are ignored.

excel_force_text

excel_force_text(s)

Prefix a value with a single quote to force Excel to treat it as text.

Parameters:

  • s (object) –

    Input value.

Returns:

  • str

    Excel-safe text representation. Empty input returns "".

excel_safe_ids

excel_safe_ids(x, *, list_sep=ID_JOIN_DELIM)

Convert an ID field into an Excel-safe, TSV-friendly text string.

This helper accepts either scalar or list-like inputs, parses them via parse_id_list(), joins the IDs with list_sep, and prefixes a single quote to force Excel "Text" interpretation.

Parameters:

  • x (object) –

    Scalar or list-like ID field.

  • list_sep (str, default: ID_JOIN_DELIM ) –

    Join delimiter for the ID list. Default is ID_JOIN_DELIM.

Returns:

  • str

    Excel-safe text value. Returns "" if the input is NA-like or empty.

hash_gene_set_12hex

hash_gene_set_12hex(genes)

Compute a set-stable gene-set fingerprint (12-hex), preserving case.

Parameters:

  • genes (list of object) –

    Gene tokens.

Returns:

  • str

    12-character lowercase hex fingerprint.

Notes

Policy: - order-invariant (set-stable) - clean_gene_token() per token - no forced uppercasing (species/ID dependent)

hash_gene_set_12hex_upper

hash_gene_set_12hex_upper(genes)

Compute a legacy-compatible gene-set fingerprint (12-hex), uppercasing IDs.

Parameters:

  • genes (list of object) –

    Gene tokens.

Returns:

  • str

    12-character lowercase hex fingerprint.

Notes

Use only when you must match older outputs that case-folded gene IDs.

hash_set_12hex

hash_set_12hex(items)

Compute a generic set-stable fingerprint (12-hex) from a list of items.

Parameters:

  • items (list of object) –

    Input items.

Returns:

  • str

    12-character lowercase hex fingerprint.

Notes

Trims tokens, drops NA-like values, de-duplicates, sorts, then hashes.

is_na_scalar

is_na_scalar(x)

Determine whether a value should be treated as NA as a scalar.

This function avoids calling pandas.isna on list-like containers because it can return array-like results and break boolean contexts.

Parameters:

  • x (object) –

    Input value.

Returns:

  • bool

    True if x is a scalar NA value (or None). Containers return False.

Notes

Strings like "na"/"nan" are not treated as scalar NA here; use is_na_token() for token-level NA checks.

is_na_token

is_na_token(s)

Check whether a value represents an NA token (case-insensitive).

This is a spec-level helper used across parsing and TSV round-trips. The NA vocabulary is centralized to prevent contract drift.

Parameters:

  • s (object) –

    Input value.

Returns:

  • bool

    True if s is None or its trimmed lowercase string form is in the NA token set.

Notes

This function treats empty strings as NA.

join_genes_tsv

join_genes_tsv(genes)

Join gene tokens into a TSV-friendly string.

Parameters:

  • genes (list of object) –

    Gene tokens.

Returns:

  • str

    Genes joined by GENE_JOIN_DELIM.

Notes

Applies clean_gene_token() and drops empty/NA tokens. Does not sort; preserves input order.

join_id_list_tsv

join_id_list_tsv(ids, *, delim=ID_JOIN_DELIM)

Join generic identifiers into a TSV-friendly string.

The join is stable and order-preserving. This function is intentionally not gene-aware to avoid over-normalization at the spec boundary.

Parameters:

  • ids (list of object) –

    Identifiers to join. None/empty/NA-like tokens are dropped.

  • delim (str, default: ID_JOIN_DELIM ) –

    Delimiter for joining. Default is ID_JOIN_DELIM.

Returns:

  • str

    Joined identifier string.

Notes
  • Preserves input order (no sorting).
  • Does not apply clean_gene_token().

join_tags

join_tags(tags, *, delim=STRESS_TAG_DELIM)

Join tags into a canonical stress tag string.

Parameters:

  • tags (list of object) –

    Tag tokens.

  • delim (str, default: STRESS_TAG_DELIM ) –

    Join delimiter. Default is STRESS_TAG_DELIM (comma).

Returns:

  • str

    Canonical tag string.

Notes

Trims whitespace, drops empties, and de-duplicates in first-seen order.

looks_like_12hex

looks_like_12hex(x)

Check whether a value is exactly 12 lowercase hex characters.

Parameters:

  • x (object) –

    Input value.

Returns:

  • bool

    True if x matches the 12-hex pattern (lowercase).

make_term_uid

make_term_uid(source, term_id)

Construct a stable term_uid from (source, term_id).

Parameters:

  • source (object) –

    Term source (e.g., "fgsea", "metascape"). Empty maps to "unknown".

  • term_id (object) –

    Term identifier. Caller should ensure it is non-empty.

Returns:

  • str

    Term UID formatted as ":".

module_hash_content12

module_hash_content12(terms, genes)

Compute a module content hash binding both term set and gene set (12-hex).

Parameters:

  • terms (list of object) –

    Term identifiers.

  • genes (list of object) –

    Gene tokens.

Returns:

  • str

    12-character lowercase hex fingerprint.

Notes
  • Terms: canonical_sorted_unique() (no uppercasing)
  • Genes: clean_gene_token() + drop NA/empty + sort/dedup (no uppercasing)
  • Payload format is stable and explicit to prevent ambiguity.

norm_gene_id_upper

norm_gene_id_upper(g)

Normalize a gene token by applying conservative cleaning and uppercasing.

Parameters:

  • g (object) –

    Gene token.

Returns:

  • str

    Cleaned and uppercased token.

Notes

This is opt-in for legacy compatibility. The default spec policy in this module is to preserve case.

normalize_direction

normalize_direction(x)

Normalize direction vocabulary across schema/distill/audit/select.

Parameters:

  • x (object) –

    Input scalar.

Returns:

  • str

    One of {"up", "down", "na"}.

Notes

This is a lightweight normalizer. Unrecognized values map to "na".

normalize_gate_mode

normalize_gate_mode(x, *, default='note')

Normalize a gate mode to canonical vocabulary: {"off", "note", "hard"}.

Parameters:

  • x (object) –

    Input value (canonical, synonym, or legacy form).

  • default (str, default: 'note' ) –

    Default to use when x is empty. If invalid, falls back to "note".

Returns:

  • str

    Canonical gate mode: "off", "note", or "hard".

Notes

Accepted synonyms include: - off: off, none, disable, disabled - note: note, warn, warning, soft - hard: hard, strict, abstain, on, enable, enabled

normalize_status_series

normalize_status_series(s)

Normalize a pandas Series of statuses to uppercase strings.

Parameters:

  • s (Series) –

    Input series.

Returns:

  • Series

    Series with string dtype, trimmed and uppercased.

Notes

NA values may become strings (e.g., "nan") after astype(str). Always validate with validate_status_values() when needed.

normalize_status_str

normalize_status_str(x)

Normalize a status value into canonical uppercase text.

Parameters:

  • x (object) –

    Input scalar.

Returns:

  • str

    Uppercased, trimmed string.

Notes

This function does not validate membership in ALLOWED_STATUSES. Use validate_status_values() for strict checking.

parse_genes

parse_genes(x)

Parse evidence genes from messy inputs into a list of cleaned tokens.

Parameters:

  • x (object) –

    Scalar or list-like gene field.

Returns:

  • list of str

    Cleaned gene tokens, deduplicated in first-seen order.

Notes

Rules: - NA scalars -> [] - list/tuple -> cleaned per-token - set -> sorted for determinism, then cleaned - string -> split conservatively via split_gene_string()

parse_id_list

parse_id_list(x)

Parse a generic ID field into a list of strings.

This is a tolerant parser for ID-like fields (term IDs, module IDs, gene IDs when treated as IDs, etc.). It is intentionally separate from parse_genes(), which is more gene-token-aware.

Parameters:

  • x (object) –

    Scalar or list-like input.

Returns:

  • list of str

    Parsed IDs in deterministic order.

Notes

Policy: - NA scalars -> [] - list/tuple -> preserve order (dedup) - set -> sorted for determinism (dedup) - string -> split on strong delimiters first: ',', ';', '|' - whitespace split only if all tokens look identifier-like - drop NA tokens and empties

seed_for_term

seed_for_term(seed, term_uid, term_row_id=None)

Create a deterministic per-term integer seed.

The seed is derived from (seed, term_uid, term_row_id) using a stable hash to keep RNG streams reproducible across platforms.

Parameters:

  • seed (int or None) –

    Optional base seed. None maps to 0.

  • term_uid (str) –

    Stable term identifier (e.g., ":").

  • term_row_id (int or None, default: None ) –

    Optional row identifier to avoid collisions for duplicate term_uids.

Returns:

  • int

    Deterministic unsigned integer seed.

Raises:

  • ValueError

    If term_row_id cannot be converted to int (when provided).

seed_int_from_payload

seed_int_from_payload(payload, *, mod=2 ** 31 - 1)

Derive a deterministic integer seed from an arbitrary payload.

Parameters:

  • payload (object) –

    Any JSON-serializable payload.

  • mod (int, default: 2 ** 31 - 1 ) –

    Modulus for the resulting seed. Default is 2**31 - 1.

Returns:

  • int

    Deterministic integer seed in [0, mod).

Notes

Uses sha256_short(..., n=12) to keep stability aligned with other IDs.

sha256_12hex

sha256_12hex(payload)

Compute a deterministic short SHA-256 hash (first 12 hex chars).

Parameters:

  • payload (str) –

    Stable string payload.

Returns:

  • str

    12-character lowercase hex digest.

sha256_short

sha256_short(obj, n=12)

Compute a deterministic SHA-256 short hash from an arbitrary payload.

Parameters:

  • obj (object) –

    Payload to hash. It is serialized via stable_json_dumps().

  • n (int, default: 12 ) –

    Number of hex characters to return. Default is 12.

Returns:

  • str

    Lowercase hex digest prefix.

Raises:

  • ValueError

    If n is not positive.

Notes
  • For n == 12, this matches the legacy behavior (sha256_12hex).
  • SHA-256 hex digests have length 64; if n > 64, the output length is effectively capped at 64 by Python slicing.

split_gene_string

split_gene_string(s)

Split a gene string into candidate tokens using conservative rules.

Parameters:

  • s (str) –

    Input gene string.

Returns:

  • list of str

    Token candidates (not yet fully cleaned).

Notes

Supported formats: - Comma/semicolon/pipe separated: "A,B", "A;B", "A|B" - Bracketed lists: "['A','B']", '["A","B"]', "{A,B}" - Slash-separated as a last resort: "A/B/C" - Whitespace-separated only if all tokens look gene-like

split_tags

split_tags(s, *, delim=STRESS_TAG_DELIM)

Split a stress tag string into normalized tags.

Parameters:

  • s (object) –

    Input scalar tag string.

  • delim (str, default: STRESS_TAG_DELIM ) –

    Canonical delimiter. Default is STRESS_TAG_DELIM (comma).

Returns:

  • list of str

    Tags in first-seen order.

Notes
  • Canonical delimiter is comma.
  • Legacy '+' is tolerated as an additional delimiter.

stable_json_dumps

stable_json_dumps(obj)

Serialize an object to deterministic JSON for hashing/provenance.

Parameters:

  • obj (object) –

    JSON-serializable object.

Returns:

  • str

    Deterministic JSON string.

Notes

Uses: - sort_keys=True - separators=(",", ":") - ensure_ascii=False

strip_excel_text_prefix

strip_excel_text_prefix(s)

Strip the Excel "force text" prefix from a value.

Excel-safe exports sometimes prefix values with a single quote ('). This helper removes one leading quote to support downstream parsing.

Parameters:

  • s (object) –

    Input value.

Returns:

  • str

    Cleaned string without a single leading quote.

validate_status_values

validate_status_values(s_norm)

Strict validation: refuse unknown status values (auditable denominators).


Noise modules (gene noise dictionaries)

Curated gene-noise patterns used by masking/evidence hygiene steps.

llm_pathway_curator.noise_lists

Noise module definitions (shared asset; conservative by default).

Rationale (paper-facing)

Marker rankings and enrichment evidence often contain ubiquitous programs (e.g., clonotypes, uninformative locus IDs) that can dominate prompts and confuse LLM interpretation. This module centralizes symbol-centric noise definitions that can be applied in prompt-facing layers while preserving evidence identity in PathwayCurator.

Policy (PathwayCurator)

LLM-PathwayCurator evaluates enrichment interpretations as audited decisions. Therefore, we do not pre-emptively remove broad biological programs (cell cycle, interferon, ribosome/mitochondria, HLA, Ig constants) from evidence by default, because they can be true biology and removing them can inflate ABSTAIN via missing/unstable evidence.

Reproducibility

Edit conservatively: changes may affect benchmark comparability. This file is dependency-free and safe to import.

```