API reference¶
This page documents the public surface of LLM-PathwayCurator.
Most users should start with the CLI (llm-pathway-curator run ...). The Python API exists for
integration and reproducible orchestration.
CLI¶
Primary entry point:
- llm-pathway-curator run ...
The CLI runs the end-to-end pipeline:
EvidenceTable → distill → modules → claims → audit → report
Pipeline¶
End-to-end orchestration (recommended integration point).
llm_pathway_curator.pipeline ¶
RunConfig
dataclass
¶
RunConfig(
evidence_table,
sample_card,
outdir,
force=False,
seed=None,
run_meta_name="run_meta.json",
tau=None,
k_claims=None,
stress_evidence_dropout_p=None,
stress_evidence_dropout_min_keep=None,
stress_contradictory_p=None,
stress_contradictory_max_extra=None,
)
Pipeline run configuration.
Parameters:
-
evidence_table(str) –Path to the input EvidenceTable TSV.
-
sample_card(str) –Path to the SampleCard JSON.
-
outdir(str) –Output directory path.
-
force(bool, default:False) –If True, allow writing into a non-empty outdir.
-
seed(int | None, default:None) –Random seed used for deterministic steps.
-
run_meta_name(str, default:'run_meta.json') –File name for run metadata JSON written under outdir.
-
tau(float | None, default:None) –Optional override for audit threshold tau. If None, uses
card.audit_tau(). -
k_claims(int | None, default:None) –Optional override for number of claims to propose.
-
stress_evidence_dropout_p(float | None, default:None) –Probability for evidence gene dropout stress test.
-
stress_evidence_dropout_min_keep(int | None, default:None) –Minimum number of genes to keep per term under dropout stress.
-
stress_contradictory_p(float | None, default:None) –Probability to inject contradictory direction claims.
-
stress_contradictory_max_extra(int | None, default:None) –Cap for number of injected contradictory rows.
Notes
This config is designed to be JSON-serializable via dataclasses.asdict.
run_pipeline ¶
run_pipeline(cfg, *, run_id=None)
Run the full LLM-PathwayCurator pipeline.
Parameters:
-
cfg(RunConfig) –Run configuration.
-
run_id(str | None, default:None) –Optional explicit run id. If None, a run id is generated.
Returns:
-
RunResult–Summary of the run, including artifact paths and meta_path.
Raises:
-
FileNotFoundError–If required input files are missing.
-
IsADirectoryError–If a required input path is a directory.
-
FileExistsError–If outdir is non-empty and cfg.force is False.
-
RuntimeError–If a required step produces zero rows.
-
Exception–Any exception raised by underlying steps is propagated after writing run_meta status="error".
Notes
Step order: distill -> modules -> select_claims -> context_review -> stress -> audit -> report -> report_jsonl.
Artifacts and run metadata are written into cfg.outdir. The run_meta.json is updated at each step to support reproducibility and debugging.
Environment variables
Many behaviors can be controlled via env vars, including: - Backend and modes: LLMPATH_BACKEND, LLMPATH_CLAIM_MODE - Context: LLMPATH_CONTEXT_ (gate/review/corpus/weights/rerank) - Stress: LLMPATH_STRESS_ (dropout/contradictory)
Contracts¶
EvidenceTable (TSV contract)¶
EvidenceTable is the normalized term × supporting-genes table used by all downstream stages.
It is the stability boundary: if the EvidenceTable is valid, distill/modules/select/audit/report
should not break.
llm_pathway_curator.schema ¶
EvidenceTable schema gate for LLM-PathwayCurator. This module defines the tool-facing EvidenceTable contract (v1) that preserves term×gene relationships across enrichment analysis tools (ORA, fgsea/GSEA, etc.). It provides robust IO, conservative column aliasing, spec-owned evidence parsing (delegated to _shared), and provenance metadata (df.attrs) for auditability.
EvidenceTable
dataclass
¶
EvidenceTable(df)
Tool-facing EvidenceTable wrapper.
This class normalizes heterogeneous enrichment analysis outputs into a stable, auditable internal representation that preserves term×gene relationships.
Notes
- The contract requires non-empty
term_id,term_name, andevidence_genes. - Parsing/normalization of gene tokens is spec-owned by
llm_pathway_curator._shared(e.g.,parse_genes,clean_gene_token), to avoid contract drift. - Provenance and health summaries are recorded in
df.attrs.
read_tsv
classmethod
¶
read_tsv(path, *, strict=False, drop_invalid=True)
Read and normalize an evidence table to the contract (v1).
This is the main schema gate that:
- aliases common column variants to contract names
- cleans required fields
- parses evidence_genes via _shared.parse_genes
- normalizes numeric fields (stat, qval, pval)
- validates the term×gene contract
- optionally computes q-values from p-values (BH) when q-values are missing
- records provenance and health metrics in df.attrs
Parameters:
-
path(str) –Input evidence table path.
-
strict(bool, default:False) –If True, the first invalid row raises ValueError. If False, invalid rows are marked (and optionally dropped). Default is False.
-
drop_invalid(bool, default:True) –If True, drop rows with
is_valid=False. Default is True. If False, keep invalid rows and rely onis_validdownstream.
Returns:
-
EvidenceTable–Normalized evidence table wrapper.
Raises:
-
ValueError–If core required columns are missing after aliasing. If
strict=Trueand an invalid row is encountered.
Notes
Contract-required columns (core) - term_id - term_name - stat - evidence_genes
Output guarantees (post-normalization)
- evidence_genes is a list-like object per row (and evidence_genes_str is TSV-safe)
- direction is normalized (typically 'up', 'down', 'na')
- df.attrs contains: contract_version, read_mode, aliasing, health
Examples:
>>> et = EvidenceTable.read_tsv("evidence_table.tsv")
>>> info = et.summarize()
>>> et.write_tsv("normalized_evidence_table.tsv")
summarize ¶
summarize()
Summarize the normalized EvidenceTable for logging and QA.
Returns:
-
dict[str, object]–Summary dictionary including: - contract version - number of terms and sources - direction counts - evidence genes per term quantiles - q-value provenance counts -
df.attrs['health']anddf.attrs['aliasing'](if present)
write_tsv ¶
write_tsv(path)
Write the normalized EvidenceTable to a TSV file.
This writer:
- applies a small Excel formula-injection defense for common text fields
- serializes evidence_genes as a TSV-friendly string column
- emits a stable column order for reproducibility
Parameters:
-
path(str) –Output TSV path.
Notes
evidence_genesis written as a joined string under the column nameevidence_genes(list form is dropped).- Normalized contract columns are emitted first; remaining columns are sorted.
Sample Card (study context contract)¶
The Sample Card is a structured record of study intent/context (e.g., condition/tissue/perturbation/comparison), used by proposal steps and context validity gates.
llm_pathway_curator.sample_card ¶
SampleCard ¶
Bases: BaseModel
SampleCard: tool-facing context and knob container.
Attributes:
-
condition, tissue, perturbation, comparison(str) –Core context keys normalized into stable strings.
-
notes(str or None) –Optional free-form notes for humans.
-
context_tokens_text(str or None) –Optional free-form text used to derive deterministic context tokens.
-
context_tokens_policy(dict[str, Any]) –Tokenization policy for deterministic context tokens.
-
context_tokens_meta(dict[str, Any]) –Optional metadata for provenance logging.
-
k_claims_value(int) –Top-level
k_claimsvalue (stored under JSON key "k_claims"). -
extra(dict[str, Any]) –Tool knobs and future-compatible fields. Flattened + alias-canonicalized.
Notes
Contract:
- Core context keys are normalized strings; NA is represented by NA_TOKEN.
- The neutral disease-like key is "condition" (legacy keys accepted on input).
- k_claims is top-level only; it is not stored inside extra.
- extra keeps unknown keys for forward compatibility.
apply_patch ¶
apply_patch(patch)
Apply a patch dictionary and return a new SampleCard.
Parameters:
-
patch(dict[str, Any]) –Patch values. Core keys are applied at top-level. Other keys are merged into
extra.
Returns:
-
SampleCard–New SampleCard instance with patch applied.
Notes
Contract enforcement:
- Never keeps k_claims or its aliases inside extra.
- Accepts legacy disease-like keys to fill "condition" when missing.
audit_min_gene_overlap ¶
audit_min_gene_overlap(default=1)
Get minimum gene overlap for evidence drift checks.
Parameters:
-
default(int, default:1) –Default value, by default 1.
Returns:
-
int–Minimum overlap threshold.
audit_tau ¶
audit_tau(default=0.8)
Get audit stability tau.
Parameters:
-
default(float, default:0.8) –Default value, by default 0.8.
Returns:
-
float–Tau value used by the audit layer.
claim_mode ¶
claim_mode(default='deterministic')
Get claim generation mode.
Parameters:
-
default(str, default:'deterministic') –Default mode, by default "deterministic".
Returns:
-
str–One of {"deterministic", "llm"}.
context_dict ¶
context_dict()
Return core context keys as a dictionary.
Returns:
-
dict[str, str]–Mapping from CORE_KEYS to their normalized values.
context_gate_mode ¶
context_gate_mode(default='hard')
Get context gate mode for audit integration.
Parameters:
-
default(str, default:'hard') –Default mode, by default "hard".
Returns:
-
str–Gate mode normalized to {"off", "note", "hard"}.
context_key ¶
context_key()
Build a stable composite context key string.
Returns:
-
str–"condition|tissue|perturbation|comparison" using normalized fields.
context_tokens ¶
context_tokens()
Compute deterministic context tokens used for anchoring.
Returns:
-
list[str]–Deterministic token list.
Notes
Priority:
1) context_tokens_text is tokenized via ctx_tokens_v1.
2) Fallback: core context fields are concatenated and tokenized.
context_tokens_effective ¶
context_tokens_effective()
Build a provenance payload for logging (pure function).
Returns:
-
dict[str, Any]–Dictionary containing: - version - tokens - n - signature - policy
context_tokens_signature ¶
context_tokens_signature()
Compute a stable short signature for current context tokens.
Returns:
-
str–12-hex sha256-based signature.
context_tokens_version ¶
context_tokens_version()
Get effective context tokenization version.
Returns:
-
str–Policy version string (currently "ctx_tokens_v1").
enable_context_score_proxy ¶
enable_context_score_proxy(default=False)
Get whether proxy context scoring is enabled.
Parameters:
-
default(bool, default:False) –Default behavior, by default False.
Returns:
-
bool–True if proxy context scoring is enabled.
from_json
classmethod
¶
from_json(path)
Load a SampleCard from a JSON file (tool contract).
Parameters:
-
path(str or Path) –Path to a JSON file containing a SampleCard object.
Returns:
-
SampleCard–Parsed and normalized SampleCard instance.
Raises:
-
FileNotFoundError–If the file does not exist.
-
ValueError–If JSON is invalid or not a dict-like object.
Notes
Backward compatibility:
- Accepts legacy disease-like keys and hoists into "condition".
- Allows k_claims stored in extra or via aliases, but hoists to top-level.
- Removes k_claims and its aliases from extra on load.
hub_frac_thr ¶
hub_frac_thr(default=0.5)
Get hub fraction threshold for ABSTAIN_HUB_BRIDGE gating.
Parameters:
-
default(float, default:0.5) –Default threshold, by default 0.5.
Returns:
-
float–Fraction clamped into [0, 1].
hub_term_degree ¶
hub_term_degree(default=200)
Get hub gene degree threshold for hub-bridge gating.
Parameters:
-
default(int, default:200) –Default threshold, by default 200.
Returns:
-
int–Threshold (>= 1).
k_claims ¶
k_claims(default=3)
Get number of claims to generate.
Parameters:
-
default(int, default:3) –Default count, by default 3.
Returns:
-
int–Number of claims (>= 1).
Notes
Top-level k_claims_value has priority. Extra is fallback only.
max_per_module ¶
max_per_module(default=1)
Get maximum claims per module (diversity control).
Parameters:
-
default(int, default:1) –Default value, by default 1.
Returns:
-
int–Maximum per module (>= 1).
min_union_genes ¶
min_union_genes(default=3)
Get minimum union evidence genes required for support.
Parameters:
-
default(int, default:3) –Default minimum, by default 3.
Returns:
-
int–Minimum union size (>= 1).
pass_notes ¶
pass_notes(default=True)
Decide whether to emit compact notes for PASS rows.
Parameters:
-
default(bool, default:True) –Default behavior, by default True.
Returns:
-
bool–True if PASS rows may receive a short note (e.g., "ok").
preselect_tau_gate ¶
preselect_tau_gate(default=False)
Get whether preselection should apply a tau gate.
Parameters:
-
default(bool, default:False) –Default behavior, by default False.
Returns:
-
bool–True if preselection tau gating is enabled.
stability_gate_mode ¶
stability_gate_mode(default='hard')
Get stability gate mode.
Parameters:
-
default(str, default:'hard') –Default mode, by default "hard".
Returns:
-
str–Gate mode normalized to {"off", "note", "hard"}.
stress_gate_mode ¶
stress_gate_mode(default='off')
Get stress gate mode for audit integration.
Parameters:
-
default(str, default:'off') –Default mode, by default "off".
Returns:
-
str–Gate mode normalized to {"off", "note", "hard"}.
strict_evidence_check ¶
strict_evidence_check(default=False)
Get strict evidence linkage policy.
Parameters:
-
default(bool, default:False) –Default behavior, by default False.
Returns:
-
bool–If True, missing evidence linkage becomes schema violation in audit.
to_json ¶
to_json(path, *, indent=2)
Serialize this SampleCard to JSON.
Parameters:
-
path(str or Path) –Output path.
-
indent(int, default:2) –JSON indentation level, by default 2.
Returns:
-
None–Writes the file.
Notes
Uses model_dump(by_alias=True) so the JSON key is "k_claims".
trust_input_survival ¶
trust_input_survival(default=False)
Decide whether to trust survival values provided in inputs.
Parameters:
-
default(bool, default:False) –Default behavior, by default False.
Returns:
-
bool–True if tool should trust input survival rather than recomputing.
Claim schema (typed JSON)¶
Claims are schema-bounded decision objects with resolvable evidence links (term/module identifiers + hashes). Free-text narratives are not treated as evidence.
llm_pathway_curator.claim_schema ¶
Typed, auditable claim schema for LLM-PathwayCurator.
This module defines strict Pydantic models for: - Evidence references (term IDs, optional gene IDs, module ID) - Typed claims (entity, direction, context keys) - Audit decisions (PASS/ABSTAIN/FAIL + reason codes)
Design: - Claim and evidence identifiers are tool-owned and deterministic. - Free-text evidence is disallowed; evidence must be referenced by IDs. - Optional context review fields are supported for audit gating.
Notes
- Status vocabulary is intentionally strict to keep denominators auditable.
- Gene ID casing is preserved for display; hashing follows tool-wide spec.
AuditedClaim ¶
Claim ¶
Bases: BaseModel
Typed claim with auditable evidence linkage.
Attributes:
-
claim_id(str) –Tool-owned stable identifier. If empty, it is filled deterministically.
-
entity(str) –Stable entity identifier (prefer IDs over free text).
-
direction({'up', 'down', 'na'}) –Canonical direction token.
-
context_keys(list of {"condition", "tissue", "perturbation", "comparison"}) –Keys the claim is conditioned on. Values live in SampleCard.
-
evidence_ref(EvidenceRef) –Evidence reference (IDs only; no free-text evidence).
Optional context review fields
context_evaluated : bool Whether context relevance review was executed. context_method : {"llm", "proxy", "none"} Method used for context review. context_status : {"PASS", "WARN", "FAIL"} or None Result of context review. context_reason : str or None Short reason (length-limited). context_notes : str or None Additional notes (length-limited).
Notes
Invariants enforced:
- If context_evaluated is False:
method="none" and status/reason/notes are cleared.
- If context_evaluated is True:
method must be "llm" or "proxy" and status must be provided.
Decision ¶
Bases: BaseModel
Mechanical audit decision for a claim.
Attributes:
-
status({'PASS', 'ABSTAIN', 'FAIL'}) –Final decision label.
-
reason(str) –Reason code. Must be "ok" or one of
ALL_REASONS. -
details(dict) –Optional structured metadata for debugging or reporting.
Raises:
-
ValueError–If
reasonis not in the allowed vocabulary.
EvidenceRef ¶
Bases: BaseModel
Evidence reference container (strict, tool-friendly).
Attributes:
-
term_ids(list of str) –Required. One or more term UID strings that define evidence.
-
gene_set_hash(str) –Optional input. If missing/invalid, it is deterministically filled: - from
gene_idswhen available, else - fromterm_idsas a fallback. -
gene_ids(list of str) –Optional. Evidence genes for display and hashing (tool spec).
-
module_id(str) –Optional. Module identifier for module-level evidence.
Notes
gene_set_hashmust be a 12-hex digest (sha256[:12]).- Extra fields are allowed to support non-breaking provenance flags
(e.g.,
gene_set_hash_source). - Term IDs are not uppercased.
Core stages (A → B → C)¶
A) Stability distillation (evidence hygiene)¶
Generates stability proxies from supporting-gene perturbations (e.g., LOO/jackknife-like survival scores). This stage does not decide PASS/ABSTAIN/FAIL.
llm_pathway_curator.distill ¶
distill_evidence ¶
distill_evidence(evidence, card, *, seed=None)
Distill evidence into stability/provenance features (A-stage; deterministic).
This function performs evidence hygiene and produces per-term stability proxies without re-running enrichment. Two modes are supported:
- evidence_perturb (default): perturb evidence genes deterministically and compute term survival as the fraction of perturbations that preserve evidence similarity.
- replicates_proxy: compute proxy survival from replicate-stacked evidence tables
(requires
replicate_id; not true patient-level re-run LOO enrichment).
Parameters:
-
evidence(DataFrame) –Normalized EvidenceTable-like dataframe with required columns: term_id, term_name, source, stat, qval, direction, evidence_genes.
-
card(SampleCard) –Sample card controlling distill knobs under
extra(namespaced asdistill_*). -
seed(int or None, default:None) –Global seed for deterministic per-term perturbations.
Returns:
-
DataFrame–Distilled table containing stable join keys (term_uid), TSV-friendly genes, survival fields, and knob provenance columns used by downstream modules/audit/report.
Raises:
-
ValueError–If required columns are missing,
statis non-numeric, evidence_genes is empty, orreplicates_proxyis requested but replicate requirements are not met.
Notes
- This stage measures stability and records provenance; it does not decide PASS/ABSTAIN/FAIL.
- Contract-critical: term×gene must be preserved post-masking (≥1 evidence gene per term).
B) Evidence modules (term–gene factorization)¶
Constructs the term–gene bipartite graph and extracts evidence modules that preserve shared vs distinct support. This stage does not decide PASS/ABSTAIN/FAIL.
llm_pathway_curator.modules ¶
ModuleOutputs
dataclass
¶
ModuleOutputs(modules_df, term_modules_df, edges_df)
Container for module factorization outputs.
Attributes:
-
modules_df(DataFrame) –Per-module summary table. One row per module_id. Contains stable hashes (terms/genes/content) and representative genes, plus optional survival fields if computed upstream.
-
term_modules_df(DataFrame) –Term-to-module assignment table. Contract: one module_id per term_uid.
-
edges_df(DataFrame) –Filtered term-by-gene edge table used for module construction. Columns: term_uid, gene_id, weight. Additional debug/provenance lives in
edges_df.attrs.
attach_module_drift_stress_tag ¶
attach_module_drift_stress_tag(
distilled_df,
drift_df,
*,
term_id_col="term_uid",
stress_col="stress_tag",
tag="module_drift",
)
Annotate terms with a stress tag when module assignment drifted.
Parameters:
-
distilled_df(DataFrame) –Distilled evidence table with
term_id_coland an optional stress tag column. -
drift_df(DataFrame) –Drift table containing
term_id_colandmodule_drift(bool). -
term_id_col(str, default:'term_uid') –Term identifier column name (default "term_uid").
-
stress_col(str, default:'stress_tag') –Column name used to store stress tags (default "stress_tag").
-
tag(str, default:'module_drift') –Tag value to append when drift is detected (default "module_drift").
Returns:
-
DataFrame–Copy of
distilled_dfwith updatedstress_col. Existing tags are preserved and the new tag is appended if missing.
Raises:
-
ValueError–If required columns are missing.
Notes
- Does not overwrite non-empty tags; it appends.
- Tag splitting/joining is delegated to
_shared.split_tagsand_shared.join_tags.
attach_module_ids ¶
attach_module_ids(
evidence_df,
term_modules_df,
*,
term_id_col="term_uid",
modules_df=None,
)
Attach module identifiers to an evidence table by term_uid.
Parameters:
-
evidence_df(DataFrame) –Evidence table that includes
term_id_col(typically "term_uid"). -
term_modules_df(DataFrame) –Term-to-module table with columns
term_id_colandmodule_id. -
term_id_col(str, default:'term_uid') –Join key column name for term identifiers.
-
modules_df(DataFrame | None, default:None) –Optional per-module table. If provided, module-level survival fields are joined onto each term row.
Returns:
-
DataFrame–Copy of
evidence_dfwith: - module_id - module_id_missing (bool) and, optionally, module survival columns ifmodules_dfwas provided.
Raises:
-
ValueError–If required columns are missing.
build_term_gene_edges ¶
build_term_gene_edges(
evidence_df,
*,
term_id_col="term_uid",
genes_col="evidence_genes",
)
Build term-by-gene bipartite edges from an evidence table.
Parameters:
-
evidence_df(DataFrame) –Evidence table containing at least a term identifier column and a gene evidence column.
-
term_id_col(str, default:'term_uid') –Column name for the term identifier in
evidence_df. -
genes_col(str, default:'evidence_genes') –Column name for evidence genes in
evidence_df. Values can be list-like (preferred) or legacy scalar strings.
Returns:
-
DataFrame–Edge table with columns: - term_uid : str - gene_id : str - weight : float
The returned DataFrame also stores a small provenance dict under
out.attrs["edges"].
Raises:
-
ValueError–If required columns are missing.
Notes
- Empty/invalid gene lists produce no edges and are dropped.
- List-like gene inputs are processed via vectorized explode.
- Scalar/string inputs are parsed via
_shared.parse_genes. - Duplicate (term_uid, gene_id) edges are summed into a single row with weight equal to the multiplicity.
compute_term_module_drift ¶
compute_term_module_drift(
baseline_term_modules_df,
stressed_term_modules_df,
*,
term_id_col="term_uid",
)
Compute per-term drift of module assignment under stress.
Parameters:
-
baseline_term_modules_df(DataFrame) –Baseline term-to-module assignments.
-
stressed_term_modules_df(DataFrame) –Stressed term-to-module assignments.
-
term_id_col(str, default:'term_uid') –Term identifier column name (default "term_uid").
Returns:
-
DataFrame–Drift table with columns: - term_uid - module_id_base - module_id_stress - module_drift (bool)
Raises:
-
ValueError–If inputs do not have required columns or violate the one-term-one-module contract.
factorize_modules_connected_components ¶
factorize_modules_connected_components(
evidence_df,
*,
method="term_jaccard_cc",
module_prefix="M",
max_gene_term_degree=None,
max_term_degree=None,
hub_degree_quantile=0.995,
min_shared_genes=3,
jaccard_min=0.1,
term_id_col="term_uid",
genes_col="evidence_genes",
sparsity_mode="auto",
shared_pos_target=0.1,
sparse_relax_min_shared_genes=2,
sparse_relax_jaccard_min=0.02,
pair_sample_max=200000,
seed=42,
)
Factorize enrichment evidence into stable "evidence modules".
This constructs a term-by-gene bipartite graph from an evidence table and groups related terms into modules. Module identity is stable: module_id is derived from a content hash of (terms, genes).
Parameters:
-
evidence_df(DataFrame) –Evidence table containing term identifiers and evidence genes.
-
method(ModuleMethod, default:'term_jaccard_cc') –Module construction method. - "term_jaccard_cc": connected components on a term-term graph derived from shared genes (recommended). - "bipartite_cc": connected components on the bipartite graph (legacy).
-
module_prefix(str, default:'M') –Prefix prepended to the module_id (default "M").
-
max_gene_term_degree(int | None, default:None) –If set, removes genes whose term-degree is strictly greater than this threshold before module construction.
-
max_term_degree(int | None, default:None) –Deprecated alias for
max_gene_term_degree. -
hub_degree_quantile(float | None, default:0.995) –If not None and explicit thresholds are not given, infer the hub degree threshold from the specified quantile of gene term-degree.
-
min_shared_genes(int, default:3) –Minimum shared genes for term-term edges (term_jaccard_cc).
-
jaccard_min(float, default:0.1) –Minimum Jaccard similarity for term-term edges (term_jaccard_cc).
-
term_id_col(str, default:'term_uid') –Column name in
evidence_dfholding the term identifier. The pipeline convention is "term_uid". -
genes_col(str, default:'evidence_genes') –Column name in
evidence_dfholding evidence genes. -
sparsity_mode(Literal['auto', 'off'], default:'auto') –If "auto", relaxes thresholds for sparse graphs and may tighten thresholds to avoid giant-component collapse.
-
shared_pos_target(float, default:0.1) –Target lower bound for P(shared_genes > 0) under auto sparsity tuning.
-
sparse_relax_min_shared_genes(int, default:2) –Relaxed min_shared_genes used when sparsity is detected.
-
sparse_relax_jaccard_min(float, default:0.02) –Relaxed jaccard_min used when sparsity is detected.
-
pair_sample_max(int, default:200000) –Maximum number of term pairs sampled for sparsity diagnostics.
-
seed(int, default:42) –Random seed for sampling-based diagnostics.
Returns:
-
ModuleOutputs–Object containing: - modules_df: per-module summary table - term_modules_df: term_uid -> module_id assignments (one per term) - edges_df: filtered edge table used to build modules
Raises:
-
ValueError–If an unknown method is requested, required columns are missing, or the term->module contract is violated.
Notes
- Hub filtering and sparsity/giant-component heuristics are recorded in
edges_df.attrs["modules"]for reproducibility and debugging. - module_id is stable and derived from module content, not from component numbering.
filter_hub_genes ¶
filter_hub_genes(
edges, *, max_gene_term_degree=200, max_term_degree=None
)
Remove hub genes that connect too many terms (high gene term-degree).
Parameters:
-
edges(DataFrame) –Edge table with columns
term_uidandgene_id. -
max_gene_term_degree(int | None, default:200) –Hub threshold. Genes with term-degree strictly greater than this value are removed. If None, no hub filtering is applied.
-
max_term_degree(int | None, default:None) –Deprecated alias for
max_gene_term_degree. If provided andmax_gene_term_degreeis None, it is used as the threshold.
Returns:
-
DataFrame–Filtered edge table. Hub filter metadata is recorded in
out.attrs["hub_filter"].
Raises:
-
ValueError–If
edgesdoes not have the required columns.
Notes
The filter uses a strict condition: degree > threshold (not >=).
summarize_module_drift ¶
summarize_module_drift(drift_df)
Summarize module drift statistics.
Parameters:
-
drift_df(DataFrame) –Output of
compute_term_module_driftwith required columns: term_uid, module_id_base, module_id_stress, module_drift.
Returns:
-
dict–Summary metrics including: - n_terms_total, n_terms_drift, term_drift_rate - n_modules_base, n_modules_stress, n_modules_shared - module_churn_rate
C1) Proposal (deterministic baseline / LLM proposal-only)¶
Proposes typed, evidence-linked candidate claims from distilled evidence and modules. Final acceptance is not decided here.
llm_pathway_curator.select ¶
select_claims ¶
select_claims(
distilled,
card,
*,
k=50,
mode=None,
backend=None,
claim_backend=None,
review_backend=None,
context_gate_mode="soft",
context_review_mode="off",
seed=None,
outdir=None,
**kwargs,
)
C1: Propose schema-locked pathway claims from distilled evidence.
Parameters:
-
distilled(DataFrame) –Distilled evidence table (optionally with module_id and context fields).
-
card(SampleCard) –Sample card providing context and selection knobs.
-
k(int, default:50) –Number of claims to propose.
-
mode(str or None, default:None) –"deterministic" or "llm". If None, resolved from env/card.
-
backend(BaseLLMBackend or None, default:None) –Backend used for LLM claim proposal when mode="llm".
-
claim_backend(BaseLLMBackend or None, default:None) –Reserved for role-based backends (currently not required here).
-
review_backend(BaseLLMBackend or None, default:None) –Backend used for LLM context review (shortlist-only).
-
context_gate_mode(str, default:'soft') –Public API legacy default is "soft". Canonical gate modes are off/note/hard; "soft" is ignored to preserve old behavior.
-
context_review_mode(str, default:'off') –"off" or "llm". When "llm", fills pipeline-owned context fields before ranking / proposal.
-
seed(int or None, default:None) –Seed for deterministic tie-breaks and optional stress probes.
-
outdir(str or None, default:None) –Output directory for small caches and artifacts.
-
**kwargs(Any, default:{}) –Forward-compatible extra arguments (ignored here).
Returns:
-
DataFrame–Proposed claims table. Includes decision-grade
claim_jsonthat embedsEvidenceRefwith gene_ids and gene_set_hash.
Notes
Selection-time context knobs (env): - LLMPATH_SELECT_CONTEXT_MODE = off|proxy|review - LLMPATH_SELECT_CONTEXT_GATE_MODE = off|note|hard
Pipeline-owned context review columns (if present) are never overwritten except when LLM review is requested and the existing method is not "llm".
llm_pathway_curator.llm_claims ¶
LLM-based claim proposal for LLM-PathwayCurator.
This module proposes structured Claim objects from distilled evidence using
an LLM backend. It is designed to be:
- contract-driven (stable IDs, deterministic evidence linking),
- robust across heterogeneous backends (OpenAI/Gemini/Ollama/local),
- audit-grade (persist prompt/candidates/raw/meta artifacts).
Key ideas
- Evidence identity is tool-owned (term_uid + gene_set_hash).
- Context VALUES are prompt-facing; context KEYS are contract-facing.
- FAIL decisions are never "promoted" by thresholding; gating affects non-FAIL.
Notes
This file contains many private helpers. Public entrypoints: - propose_claims_llm - claims_to_proposed_tsv
LLMClaimResult
dataclass
¶
LLMClaimResult(
claims, raw_text, used_fallback, notes, meta
)
Container for LLM claim proposal results.
Attributes:
-
claims(list[Claim]) –Validated and post-processed claims. Empty if failure/fallback.
-
raw_text(str) –Raw JSON text persisted for audit/debug.
-
used_fallback(bool) –True if LLM output was unusable or a soft-error occurred.
-
notes(str) –Compact status note (e.g., "ok", "post_validate_failed: ...").
-
meta(dict[str, Any]) –Metadata used for reproducibility (k, top_n, hashes, backend class, etc.).
build_claim_prompt ¶
build_claim_prompt(*, card, candidates, k)
Build a compact JSON-only prompt for proposing claims.
Parameters:
-
card(SampleCard) –Sample card providing context values and stable context keys.
-
candidates(DataFrame) –Candidate evidence rows (top_n pool) used as the ONLY selectable source. Expected columns include term_uid, term_id, term_name, direction, and optionally term_survival and gene_ids_suggest/evidence_genes.
-
k(int) –Target number of claims to request from the model.
Returns:
-
str–Prompt string instructing the model to return valid JSON only.
Notes
The prompt enforces copy-exact rules for: - entity == term_id - evidence_ref.term_ids == [term_uid] Context values are prompt-facing only; identity uses context KEYS.
claims_to_proposed_tsv ¶
claims_to_proposed_tsv(
*, claims, distilled_with_modules, card
)
Convert proposed claims into a flat TSV-like DataFrame for export.
Parameters:
-
claims(list[Claim]) –Proposed claims (typically from
propose_claims_llm). -
distilled_with_modules(DataFrame) –Distilled evidence table used to enrich exported rows with term metadata.
-
card(SampleCard) –Sample card providing context values (export columns).
Returns:
-
DataFrame–Row-wise export with fields including: claim_id, entity, direction, context_keys, term_uid, module_id, gene_ids, term_ids, gene_set_hash, and serialized claim_json.
Notes
Context VALUES are exported as columns for convenience, but MUST NOT be baked into identity (claim_id / gene_set_hash).
propose_claims_llm ¶
propose_claims_llm(
*,
distilled_with_modules,
card,
backend,
k,
seed=None,
outdir=None,
artifact_tag=None,
)
Propose claims via an LLM and write audit-grade artifacts.
Parameters:
-
distilled_with_modules(DataFrame) –Distilled evidence table with module information (or sufficient columns to derive term_uid). Must contain: - term_uid OR (source, term_id) - term_id, term_name, source Optional: - module_id, gene_set_hash - evidence_genes / evidence_genes_str / gene_ids_suggest - keep_term, term_survival, stat, context_score
-
card(SampleCard) –Sample card providing prompt context and contract keys.
-
backend(BaseLLMBackend) –LLM backend adapter.
-
k(int) –Target number of claims.
-
seed(int or None, default:None) –Optional seed (best-effort; may be ignored).
-
outdir(str or None, default:None) –Output directory for artifacts. If None, no artifacts are written.
-
artifact_tag(str or None, default:None) –Optional tag to avoid overwriting per-call artifacts.
Returns:
-
LLMClaimResult–Claims and metadata. On failure,
claimsmay be empty andused_fallbackTrue.
Raises:
-
ValueError–If required columns are missing.
-
RuntimeError–If LLM is required by contract and call/validation fails.
Notes
Artifacts (when outdir is set):
- llm_claims.prompt.json
- llm_claims.candidates.json
- llm_claims.raw.json
- llm_claims.meta.json
Plus tagged variants when artifact_tag is provided.
C2) Mechanical audit (decider)¶
Assigns PASS/ABSTAIN/FAIL with precedence (FAIL > ABSTAIN > PASS) using predefined audit gates. Produces standardized reason codes and audit logs.
llm_pathway_curator.audit ¶
audit_claims ¶
audit_claims(claims, distilled, card, *, tau=None)
Mechanically audit claims against distilled evidence and sample context.
Parameters:
-
claims(DataFrame) –Claims table. Must include
claim_jsonwith Claim schema JSON. -
distilled(DataFrame) –Distilled evidence table. Must provide term linkage via
term_uidor (source,term_id). Evidence genes are read fromevidence_genesorevidence_genes_str. Stability usesterm_survivalwhen available. -
card(SampleCard) –Sample card providing audit knobs and gate modes.
-
tau(float or None, default:None) –Override stability tau. If None, uses
card.audit_tau().
Returns:
-
DataFrame–Audited claims with status, reasons, and audit notes.
Raises:
-
ValueError–If
distilledcannot provide term linkage (missing required columns).
Notes
Status priority is: FAIL > ABSTAIN > PASS.
Major checks: - Linkage: term_id -> term_uid resolution; reject unknown/ambiguous terms. - Evidence identity: gene_set_hash match against computed union evidence genes. - Stability: term-level survival aggregation (min across referenced terms). - Under-support: minimum union evidence genes. - Hub-bridge: abstain when evidence is dominated by hub genes. - Context gate: uses claim schema context review, with optional proxy fallback. - Stress probes: optional internal dropout and contradiction probes and/or external stress columns; treated as ABSTAIN (inconclusive), not FAIL.
llm_pathway_curator.audit_reasons ¶
is_abstain_reason ¶
is_abstain_reason(code)
Check whether a reason code is an ABSTAIN reason.
Parameters:
-
code(str) –Reason code string.
Returns:
-
bool–True if
codeis inABSTAIN_REASONS, otherwise False.
Notes
ABSTAIN_REASONS is part of the paper's reproducible output contract and
should remain stable.
is_decision_reason ¶
is_decision_reason(code)
Check whether a string is a valid decision reason code.
This includes the sentinel "ok" as well as all known FAIL/ABSTAIN reason codes.
Parameters:
-
code(str) –Decision reason code.
Returns:
-
bool–True if
codeis "ok" or is included inALL_REASONS, otherwise False.
is_fail_reason ¶
is_fail_reason(code)
Check whether a reason code is a FAIL reason.
Parameters:
-
code(str) –Reason code string.
Returns:
-
bool–True if
codeis inFAIL_REASONS, otherwise False.
Notes
FAIL_REASONS is part of the paper's reproducible output contract and
should remain stable.
is_known_reason ¶
is_known_reason(code)
Check whether a reason code is known by this module.
Parameters:
-
code(str) –Reason code string.
Returns:
-
bool–True if
codeis inALL_REASONS, otherwise False.
Notes
ALL_REASONS excludes "ok" by design. Use is_decision_reason() when
you want to accept the "ok" sentinel.
C3) Reporting (decision-grade outputs)¶
Writes decision objects (report.jsonl / report.md) and renders audit logs with provenance.
llm_pathway_curator.report ¶
write_report ¶
write_report(audit_log, distilled, card, outdir)
Write a human-facing markdown report and TSV artifacts.
Outputs
out/report.md(human-facing summary)out/audit_log.tsv(canonicalized audit log)out/distilled.tsv(stringified distilled evidence table)out/risk_coverage.tsv(optional; when calibration functions exist)
Parameters:
-
audit_log(DataFrame) –Audit log DataFrame containing PASS/ABSTAIN/FAIL outcomes and supporting fields.
-
distilled(DataFrame) –Distilled evidence table DataFrame.
-
card(SampleCard) –SampleCard providing analysis context (condition/tissue/etc.).
-
outdir(str) –Output directory path.
Returns:
-
None–
Notes
- This function does NOT write
report.jsonl. JSONL export is explicit viawrite_report_jsonl(...). - Gene symbol mapping in this report is DISPLAY-ONLY: it does not affect auditing or evidence identity.
- The report remains best-effort and will fall back to a minimal report if required decision columns are missing.
write_report_jsonl ¶
write_report_jsonl(
audit_log,
card,
outdir,
*,
run_id,
method=None,
tau=None,
condition=None,
comparison=None,
cancer=None,
disease=None,
)
Write an audit-grade JSONL report artifact (out/report.jsonl).
This export is designed to be robust and reproducible:
- Accepts claim_json or common fallbacks as the payload source.
- If typed Claim validation fails, emits a minimal stub instead of
crashing.
- Missing metric columns do not crash the export (nulls are emitted).
Parameters:
-
audit_log(DataFrame) –Audit log DataFrame. Required columns: - status - claim JSON payload column (one of: claim_json, claim_json_str, claim_json_raw). If missing, the payload is synthesized from audit-log columns when possible.
-
card(SampleCard) –SampleCard used to supply context defaults and optional metadata.
-
outdir(str) –Output directory path.
-
run_id(str) –Run identifier string. If empty, a UTC timestamp is used.
-
method(str | None, default:None) –Method label. Default is "llm-pathway-curator".
-
tau(float | None, default:None) –Tau value to store in the JSONL. If None, resolves from
card. -
condition(str | None, default:None) –Optional override for the condition label stored in JSONL.
-
comparison(str | None, default:None) –Optional override for the comparison label stored in JSONL.
-
cancer(str | None, default:None) –Backward-compatible alias for condition (discouraged for new use).
-
disease(str | None, default:None) –Backward-compatible alias for condition (discouraged for new use).
Returns:
-
Path–Path to the written
report.jsonl.
Raises:
-
ValueError–If required columns are missing and the claim payload cannot be synthesized.
Notes
- This function does not write
report.md. Usewrite_reportfor the human-facing markdown report. - Developer-only metadata can be enabled via
LLMPATH_REPORT_INCLUDE_DEV_META.
Backends (proposal-only LLM)¶
LLM backends are used only for proposal steps (representative selection + typing) when enabled. Backends should support deterministic settings where possible and persist prompt/raw/meta artifacts.
llm_pathway_curator.backends ¶
BaseLLMBackend ¶
Bases: ABC
Backend-agnostic LLM interface.
This class defines a minimal contract for generating text or JSON strings.
Contract
Input prompt : str
Output
json_mode=False
Returns a single string (free-form). Implementations may return a
human-readable error string on failure.
json_mode=True
Must return either:
(a) a valid JSON string parseable by json.loads, or
(b) a standardized soft error JSON string:
{"error": {"message": "...", "type": "...", "retryable": true/false}}
Notes
Convenience aliases are provided (invoke, call, complete, chat, and
*_json helpers). Subclasses should implement generate.
call ¶
call(prompt, **kwargs)
Alias for invoke.
Parameters:
-
prompt(str) –Input prompt string.
-
**kwargs(Any, default:{}) –Optional keyword arguments.
Returns:
-
str–Model output string.
chat ¶
chat(messages, **kwargs)
Best-effort chat wrapper.
Parameters:
-
messages(Any) –Chat-like messages. Typically a list of dicts or strings. If a list is provided, the last element's "content" field (if dict) is used as prompt.
-
**kwargs(Any, default:{}) –Optional keyword arguments passed to
invoke.
Returns:
-
str–Model output string.
Notes
This is intentionally lightweight and is not a full chat protocol
implementation. It extracts a prompt and delegates to invoke.
chat_json ¶
chat_json(prompt, **kwargs)
Generate JSON output from a prompt (chat-style helper).
Parameters:
-
prompt(str) –Input prompt string.
-
**kwargs(Any, default:{}) –Optional keyword arguments (ignored except for future compatibility).
Returns:
-
str–JSON string or standardized soft error JSON string.
complete ¶
complete(prompt, **kwargs)
Alias for invoke.
Parameters:
-
prompt(str) –Input prompt string.
-
**kwargs(Any, default:{}) –Optional keyword arguments.
Returns:
-
str–Model output string.
complete_json ¶
complete_json(prompt, **kwargs)
Generate JSON output from a prompt (completion-style helper).
Parameters:
-
prompt(str) –Input prompt string.
-
**kwargs(Any, default:{}) –Optional keyword arguments (ignored except for future compatibility).
Returns:
-
str–JSON string or standardized soft error JSON string.
generate
abstractmethod
¶
generate(prompt, json_mode=False)
Generate a completion for a given prompt.
Parameters:
-
prompt(str) –Input prompt string.
-
json_mode(bool, default:False) –If True, the backend must return a JSON string (or a standardized soft error JSON). If False, free-form text is allowed.
Returns:
-
str–Model output. See class-level contract for json_mode behavior.
Raises:
-
NotImplementedError–If the backend does not implement this method.
generate_json ¶
generate_json(prompt, **kwargs)
Generate JSON output from a prompt (explicit helper).
Parameters:
-
prompt(str) –Input prompt string.
-
**kwargs(Any, default:{}) –Optional keyword arguments (ignored except for future compatibility).
Returns:
-
str–JSON string or standardized soft error JSON string.
invoke ¶
invoke(prompt, **kwargs)
Invoke the backend with a prompt (alias for generate).
Parameters:
-
prompt(str) –Input prompt string.
-
**kwargs(Any, default:{}) –Optional keyword arguments.
json_modeis recognized.
Returns:
-
str–Model output string.
json ¶
json(prompt, **kwargs)
Alias for JSON generation helpers.
Parameters:
-
prompt(str) –Input prompt string.
-
**kwargs(Any, default:{}) –Optional keyword arguments.
Returns:
-
str–JSON string or standardized soft error JSON string.
GeminiBackend ¶
GeminiBackend(
api_key,
model_name="models/gemini-2.0-flash",
temperature=0.0,
)
Bases: BaseLLMBackend
Google Gemini backend via google-generativeai.
Parameters:
-
api_key(str) –Gemini API key.
-
model_name(str, default:'models/gemini-2.0-flash') –Gemini model identifier (e.g., "models/gemini-2.0-flash").
-
temperature(float, default:0.0) –Sampling temperature.
Notes
- In json_mode, response is requested with MIME type "application/json" and validated. Non-JSON output is converted to standardized soft error JSON.
Initialize the Gemini backend.
Parameters:
-
api_key(str) –Gemini API key.
-
model_name(str, default:'models/gemini-2.0-flash') –Gemini model identifier.
-
temperature(float, default:0.0) –Sampling temperature.
Raises:
-
ImportError–If
google-generativeaiis not installed.
generate ¶
generate(prompt, json_mode=False)
Generate a completion using Gemini.
Parameters:
-
prompt(str) –Input prompt string.
-
json_mode(bool, default:False) –If True, attempts to enforce JSON output and validates with
json.loads.
Returns:
-
str–Free-form text (json_mode=False), or a JSON string / standardized soft error JSON (json_mode=True).
LocalLLMBackend ¶
Bases: BaseLLMBackend
Local/offline backend stub.
This backend does not perform real generation. It exists to support offline workflows and testing paths.
Notes
- In json_mode, returns a standardized soft error JSON payload.
- In text mode, returns a human-readable placeholder string.
generate ¶
generate(prompt, json_mode=False)
Return a placeholder response (local/offline stub).
Parameters:
-
prompt(str) –Input prompt string (ignored).
-
json_mode(bool, default:False) –If True, returns standardized soft error JSON.
Returns:
-
str–Placeholder text or standardized soft error JSON.
OllamaBackend ¶
OllamaBackend(
host=None,
model_name=None,
temperature=None,
timeout=None,
)
Bases: BaseLLMBackend
Ollama backend using HTTP API (/api/generate) via urllib.
Parameters:
-
host(str | None, default:None) –Ollama server base URL (e.g., "http://ollama:11434").
-
model_name(str | None, default:None) –Ollama model name (e.g., "llama3.1:8b").
-
temperature(float | None, default:None) –Sampling temperature.
-
timeout(float | None, default:None) –Legacy single timeout (seconds) applied to both connect/read timeouts.
Notes
- urllib accepts a single timeout value. This implementation stores both connect/read timeouts but uses read_timeout for urllib's timeout.
- In json_mode, payload includes "format": "json" and output is validated. Non-JSON output is converted to standardized soft error JSON.
Initialize the Ollama backend.
Parameters:
-
host(str | None, default:None) –Base URL for Ollama server. If None, falls back to env defaults.
-
model_name(str | None, default:None) –Model name. If None, falls back to env defaults.
-
temperature(float | None, default:None) –Sampling temperature. If None, falls back to env default.
-
timeout(float | None, default:None) –Legacy single timeout applied to both connect/read.
Notes
Timeout resolution supports: - New envs: LPC_OLLAMA_CONNECT_TIMEOUT / LLMPATH_OLLAMA_CONNECT_TIMEOUT LPC_OLLAMA_READ_TIMEOUT / LLMPATH_OLLAMA_READ_TIMEOUT - Legacy env: LPC_OLLAMA_TIMEOUT / LLMPATH_OLLAMA_TIMEOUT
generate ¶
generate(prompt, json_mode=False)
Generate a completion using Ollama /api/generate.
Parameters:
-
prompt(str) –Input prompt string.
-
json_mode(bool, default:False) –If True, requests JSON output and validates with
json.loads.
Returns:
-
str–Free-form text (json_mode=False), or a JSON string / standardized soft error JSON (json_mode=True).
Notes
- Adaptive read-timeout escalation is applied on timeout errors:
read_timeout *= factorup to a max, for a limited number of escalations. - connect_timeout is stored for metadata/documentation only and is not used by urllib (single-timeout limitation).
OpenAIBackend ¶
OpenAIBackend(
api_key, model_name="gpt-4o", temperature=0.0, seed=42
)
Bases: BaseLLMBackend
OpenAI backend using the openai Python SDK (chat completions).
Parameters:
-
api_key(str) –OpenAI API key.
-
model_name(str, default:'gpt-4o') –Model name (e.g., "gpt-4o").
-
temperature(float, default:0.0) –Sampling temperature.
-
seed(int, default:42) –Seed used when supported by the API/model. If seeding fails, a fallback call without seed is attempted.
Notes
- In json_mode,
response_format={"type": "json_object"}is used and output is validated. Non-JSON output is converted to standardized soft error JSON.
Initialize the OpenAI backend.
Parameters:
-
api_key(str) –OpenAI API key.
-
model_name(str, default:'gpt-4o') –Model name.
-
temperature(float, default:0.0) –Sampling temperature.
-
seed(int, default:42) –Seed value for deterministic sampling when supported.
Raises:
-
ImportError–If the
openaipackage is not installed.
generate ¶
generate(prompt, json_mode=False)
Generate a completion using OpenAI chat completions.
Parameters:
-
prompt(str) –Input prompt string.
-
json_mode(bool, default:False) –If True, requests JSON object output and validates with
json.loads.
Returns:
-
str–Free-form text (json_mode=False), or a JSON string / standardized soft error JSON (json_mode=True).
Notes
If the seeded call fails, a second call without seed is attempted.
get_backend_from_env ¶
get_backend_from_env(seed=None)
Create an LLM backend based on environment variables.
Parameters:
-
seed(int | None, default:None) –Optional seed for backends that support seeded generation.
Returns:
-
BaseLLMBackend–Instantiated backend.
Raises:
-
KeyError–If a required API key is missing for the selected backend.
-
ValueError–If the backend name is unknown.
Notes
Backend selection envs (first non-empty wins): - LPC_BACKEND, BACKEND, LLMPATH_BACKEND
Supported backends: - "openai": uses OpenAI chat completions - "gemini": uses Google Generative AI - "ollama": uses Ollama HTTP API - "local" / "offline": stub backend (no real generation)
Compatibility: - Both "LLMPATH_" and "LPC_" prefixes are accepted for most settings. - For overlapping keys, LPC_ is preferred over vendor env, then LLMPATH_.
retry_with_backoff ¶
retry_with_backoff(retries=3, backoff_in_seconds=1.0)
Decorator factory for exponential backoff retries on backend calls.
Parameters:
-
retries(int, default:3) –Maximum number of retry attempts (not counting the initial call).
-
backoff_in_seconds(float, default:1.0) –Base backoff duration in seconds. Sleep time grows as:
backoff_in_seconds * 2**attempt, with small jitter.
Returns:
-
callable–A decorator that wraps a function and retries under certain conditions.
Retry conditions
- Retryable exceptions inferred by message heuristics (status/keywords).
- Legacy plain-text soft errors: "OpenAI Error: ...", "Gemini Error: ...", "Ollama Error: ..."
- Standardized soft error JSON payloads: {"error": {"message": "...", "type": "...", "retryable": ...}}
- When json_mode=True: invalid JSON outputs are treated as parse failures and retried at most once.
Notes
json_mode is inferred from kwargs (json_mode=) or from positional ABI:
(self, prompt, json_mode=False) when present.
Adapters (Input → EvidenceTable)¶
Adapters normalize upstream enrichment outputs into the EvidenceTable contract. They are intentionally conservative: preserve evidence identity (term × genes), avoid destructive parsing, and keep TSV round-trips stable.
llm_pathway_curator.adapters.fgsea ¶
FgseaAdapterConfig
dataclass
¶
FgseaAdapterConfig(
source_name="fgsea",
require_genes=True,
keep_pval=True,
term_id_mode="raw",
drop_na_qval=True,
sort_output=True,
)
Configuration for converting an fgsea result table to EvidenceTable.
Attributes:
-
source_name(str) –Value to populate the EvidenceTable
sourcecolumn. -
require_genes(bool) –If True, raise an error when
leadingEdgeyields no genes. -
keep_pval(bool) –If True and
pvalexists, store it separately (does not replace qval). -
term_id_mode(str) –Term identifier policy.
"raw":term_id == pathway(recommended; paper-aligned)"prefixed_hashed":term_id == "FGSEA:<slug>|<hash>"(legacy)
-
drop_na_qval(bool) –If True, drop rows where qval (padj) is missing.
-
sort_output(bool) –If True, sort output deterministically by
qvalasc thenabs(stat)desc.
Notes
Defaults are chosen to match the paper-side EvidenceTable behavior: human-readable term IDs, stable ordering, and dropping NA q-values.
read_fgsea_table ¶
read_fgsea_table(path)
Read an fgsea result table from disk.
Supports TSV by default and falls back to delimiter sniffing or whitespace parsing (best-effort).
Parameters:
-
path(str) –Path to an fgsea result file.
Returns:
-
DataFrame–Parsed fgsea table.
fgsea_to_evidence_table ¶
fgsea_to_evidence_table(fgsea_df, *, config=None)
Convert an fgsea result table to the EvidenceTable contract.
Parameters:
-
fgsea_df(DataFrame) –fgsea results table. Must contain (after aliasing)
pathwayandleadingEdgeplus at least one statistic column amongNES/ES. -
config(FgseaAdapterConfig or None, default:None) –Conversion configuration. If None, defaults are used.
Returns:
-
DataFrame–EvidenceTable with core columns:
term_id: strterm_name: strsource: strstat: floatqval: float or NA (from padj only)direction: {"up", "down", "na"}evidence_genes: list[str]
Plus minimal provenance fields (e.g.,
pval,term_id_h).
Raises:
-
ValueError–If required columns are missing, if no stat column is present, if
pathwayis empty, if the stat column is non-numeric, or ifrequire_genes=Trueand evidence genes are empty.
Notes
- Only
padjis treated as q-value (FDR) and mapped toqval.pvalis stored separately when present and enabled. - Output ordering can be stabilized via
sort_output.
convert_fgsea_table_to_evidence_tsv ¶
convert_fgsea_table_to_evidence_tsv(
in_path, out_path, *, config=None
)
Read an fgsea table, convert it, and write an EvidenceTable TSV.
This is a convenience wrapper around:
read_fgsea_table -> fgsea_to_evidence_table -> TSV write.
Parameters:
-
in_path(str) –Path to the fgsea result file.
-
out_path(str) –Destination path for the EvidenceTable TSV.
-
config(FgseaAdapterConfig or None, default:None) –Conversion configuration. If None, defaults are used.
Returns:
-
DataFrame–EvidenceTable as written, with
evidence_genesserialized for TSV.
Raises:
-
ValueError–Propagated from
fgsea_to_evidence_tableon invalid inputs.
llm_pathway_curator.adapters.metascape ¶
MetascapeAdapterConfig
dataclass
¶
MetascapeAdapterConfig(
source_name="metascape",
sheet_name="Enrichment",
include_summary=False,
prefer_symbols=True,
strict_qval=False,
drop_na_qval=True,
)
Configuration for converting Metascape exports to an EvidenceTable.
Attributes:
-
source_name(str) –Value to populate the EvidenceTable
sourcecolumn. -
sheet_name(str) –Excel sheet to read when the input is
.xlsx/.xls. -
include_summary(bool) –Whether to include rows whose
GroupIDends with"_Summary". The default is False to avoid summary rows being treated as evidence. -
prefer_symbols(bool) –Prefer the
Symbolscolumn overGeneswhen both exist. -
strict_qval(bool) –If True, raise an error when
Log(q-value)is present but no valid q-values can be reconstructed. -
drop_na_qval(bool) –If True, drop rows whose reconstructed q-value is missing.
read_metascape_table ¶
read_metascape_table(path, *, sheet_name='Enrichment')
Read a Metascape export file into a DataFrame.
Supports Excel exports (.xlsx/.xls) and delimited text inputs.
For Excel, the Enrichment sheet is the canonical input.
Parameters:
-
path(str) –Path to a Metascape export file.
-
sheet_name(str, default:'Enrichment') –Sheet to read for Excel inputs. Default is
"Enrichment".
Returns:
-
DataFrame–Parsed Metascape table.
metascape_to_evidence_table ¶
metascape_to_evidence_table(metascape_df, *, config=None)
Convert a Metascape Enrichment table to the EvidenceTable contract.
The resulting EvidenceTable is term-centric (one row per term) and carries evidence genes suitable for downstream factorization.
Parameters:
-
metascape_df(DataFrame) –Metascape "Enrichment" sheet as a DataFrame.
-
config(MetascapeAdapterConfig or None, default:None) –Conversion configuration. If None, defaults are used.
Returns:
-
DataFrame–EvidenceTable with (at minimum) these columns:
term_id: strterm_name: strsource: strstat: floatqval: floatdirection: str (Metascape ORA yields"na")evidence_genes: list[str]
Plus provenance/optional columns (e.g.,
group_id,is_summary).
Raises:
-
ValueError–If required columns are missing, if evidence genes are empty for any row, if
Term/Descriptionare empty, or if statistic columns are non-numeric.
Notes
- q-values are reconstructed from
Log(q-value)using sign inference. statis made monotone-positive by takingabs(...)of the chosen log column, for ranking and paper-friendly plotting.
convert_metascape_table_to_evidence_tsv ¶
convert_metascape_table_to_evidence_tsv(
in_path, out_path, *, config=None
)
Read a Metascape export, convert it, and write an EvidenceTable TSV.
This is a convenience wrapper around:
read_metascape_table -> metascape_to_evidence_table -> TSV write.
Parameters:
-
in_path(str) –Path to the Metascape export (Excel or text).
-
out_path(str) –Destination path for the EvidenceTable TSV.
-
config(MetascapeAdapterConfig or None, default:None) –Conversion configuration. If None, defaults are used.
Returns:
-
DataFrame–EvidenceTable as written, with
evidence_genesserialized for TSV.
Raises:
-
ValueError–Propagated from
metascape_to_evidence_tablewhen inputs are invalid or evidence cannot be constructed.
Calibration (risk–coverage)¶
Utilities for selecting an operating point (e.g., τ) along the risk–coverage trade-off. This stage does not change evidence identity; it tunes conservativeness.
llm_pathway_curator.calibrate ¶
CalibrationResult
dataclass
¶
CalibrationResult(method, params)
Calibration result object.
Attributes:
-
method({'none', 'temperature', 'isotonic'}) –Calibration method identifier.
-
params(dict[str, Any]) –Method parameters: - temperature: {"T": float} - isotonic: {"model": fitted_model} - none: {}
Notes
This object is serializable only when params are JSON-safe. (isotonic model objects are not JSON-serializable by default.)
apply ¶
apply(probs)
Apply the calibration mapping to probability-like scores.
Parameters:
-
probs(ndarray) –Probability array.
Returns:
-
ndarray–Calibrated probabilities clipped to (0, 1).
Raises:
-
ValueError–If
methodis unknown or required params are missing.
apply_isotonic ¶
apply_isotonic(model, probs)
Apply a fitted isotonic regression model to probabilities.
Parameters:
-
model(Any) –Fitted isotonic regression model with
predict. -
probs(ndarray) –Probability array.
Returns:
-
ndarray–Calibrated probabilities (float array).
apply_temperature_scaling ¶
apply_temperature_scaling(probs, T)
Apply temperature scaling to probability-like scores in [0, 1].
Parameters:
-
probs(ndarray) –1D probability-like array.
-
T(float) –Temperature parameter (must be finite and > 0).
Returns:
-
ndarray–Calibrated probabilities clipped to (0, 1).
Raises:
-
ValueError–If T is invalid.
calibrate_probs ¶
calibrate_probs(
probs,
y_true,
*,
method="temperature",
allow_unlabeled=False,
)
Stage-2 calibration entry point.
Parameters:
-
probs(ndarray) –1D probability-like array in [0, 1].
-
y_true(ndarray or None) –Optional binary labels in {0, 1}.
-
method(('none', 'temperature', 'isotonic'), default:"none") –Calibration method. Default is "temperature".
-
allow_unlabeled(bool, default:False) –If True and y_true is None, returns a no-op calibration ("none"). If False and y_true is None, refuses to fit.
Returns:
-
CalibrationResult–Calibration mapping object.
Raises:
-
ValueError–If inputs are invalid or fitting is requested without labels.
Notes
Design intent: - Keep dependencies optional (no scipy). - Temperature scaling uses deterministic grid search.
compute_counts ¶
compute_counts(status)
Count PASS/FAIL/ABSTAIN/TOTAL from a status series (strict validation).
Parameters:
-
status(Series) –Status values. Must normalize into {"PASS", "ABSTAIN", "FAIL"}.
Returns:
-
dict[str, int]–Counts with keys: {"PASS", "FAIL", "ABSTAIN", "TOTAL"}.
Raises:
-
ValueError–If unknown status values are present (strict spec validation).
extract_probs_and_labels ¶
extract_probs_and_labels(
audit_log, *, prob_col, label_col=None
)
Extract probability-like scores and optional strict binary labels.
Parameters:
-
audit_log(DataFrame) –Audit log table.
-
prob_col(str) –Column name containing probabilities/scores.
-
label_col(str or None, default:None) –Column name containing labels. Only exact {0,1} accepted.
Returns:
-
tuple[ndarray, ndarray or None]–(probs, labels). Labels are returned as int array when provided.
Raises:
-
ValueError–If columns are missing or values are non-numeric/non-finite, or labels are not exactly binary {0,1}.
fit_isotonic_regression ¶
fit_isotonic_regression(probs, y_true)
Fit isotonic regression mapping probs -> calibrated probs.
Parameters:
-
probs(ndarray) –1D probability-like array in [0, 1].
-
y_true(ndarray) –1D binary labels in {0, 1}.
Returns:
-
Any–Fitted isotonic regression model (scikit-learn object).
Raises:
-
ImportError–If scikit-learn is not available.
-
ValueError–If inputs are invalid.
fit_temperature_scaling ¶
fit_temperature_scaling(
probs, y_true, *, grid=(0.25, 10.0, 80)
)
Fit a single temperature T > 0 by minimizing NLL (binary labels).
Model
p' = sigmoid(logit(p) / T)
Parameters:
-
probs(ndarray) –1D probability-like array in [0, 1].
-
y_true(ndarray) –1D binary labels in {0, 1}.
-
grid(tuple[float, float, int], default:(0.25, 10.0, 80)) –(t_min, t_max, n_grid). Search is performed in log-space.
Returns:
-
float–Best temperature T, clipped to a conservative range [0.25, 10.0].
Raises:
-
ValueError–If inputs are invalid or the grid is invalid.
Notes
No scipy dependency: uses deterministic grid search.
risk_coverage_curve ¶
risk_coverage_curve(
df,
*,
score_col,
status_col="status",
decision_thresholds=None,
pass_if_score_ge=True,
promote_abstain=True,
fail_on_degenerate=False,
max_thresholds=200,
)
Build a Risk–Coverage curve by sweeping a PASS threshold.
Parameters:
-
df(DataFrame) –Input table containing score and status columns.
-
score_col(str) –Column name of probability-like or score values.
-
status_col(str, default:'status') –Column name of base status. Default is "status".
-
decision_thresholds(list of float or None, default:None) –Thresholds to sweep. If None, thresholds are derived from scores.
-
pass_if_score_ge(bool, default:True) –If True, PASS when score >= threshold; else PASS when score <= threshold.
-
promote_abstain(bool, default:True) –If True, among non-FAIL items reassign: PASS if threshold satisfied else ABSTAIN. If False, gate only existing PASS -> ABSTAIN below threshold.
-
fail_on_degenerate(bool, default:False) –If True, raise on degenerate score distributions (<=1 unique value).
-
max_thresholds(int, default:200) –Max thresholds when auto-deriving. Must be >= 10.
Returns:
-
DataFrame–One row per threshold with risk/coverage metrics and metadata fields: threshold, score_col, status_col, pass_if_score_ge, promote_abstain.
Raises:
-
ValueError–If required columns are missing, scores are invalid, statuses are invalid, or thresholds are empty/invalid.
Notes
Safety semantics: - FAIL is never changed. - ABSTAIN never enters the risk denominator.
risk_coverage_from_status ¶
risk_coverage_from_status(status)
Compute spec-safe Risk/Coverage metrics from a status series.
Parameters:
-
status(Series) –Status values in {"PASS", "ABSTAIN", "FAIL"}.
Returns:
-
dict[str, float]–Metrics with explicit denominators:
- coverage_pass_total PASS / TOTAL
- coverage_decided_total (PASS + FAIL) / TOTAL
- risk_fail_given_decided FAIL / (PASS + FAIL)
- risk_fail_total FAIL / TOTAL
- fail_rate_total Alias of FAIL / TOTAL (kept for backward compatibility)
Also includes count fields as floats: n_pass, n_fail, n_abstain, n_decided, n_total
Notes
"decided" = PASS ∪ FAIL (ABSTAIN excluded). FAIL is a negative decision produced by mechanical audits.
Shared utilities (spec-level)¶
Spec-critical helpers for contract stability (NA handling, gene parsing/joining, stable hashes). If you need to compare outputs across versions, this is the layer that prevents drift.
llm_pathway_curator._shared ¶
canonical_sorted_unique ¶
canonical_sorted_unique(xs)
Canonicalize a list of values into sorted unique strings.
Parameters:
-
xs(list of object) –Input values.
Returns:
-
list of str–Sorted unique tokens after trimming and NA filtering.
clean_gene_token ¶
clean_gene_token(g)
Clean a single gene-like token conservatively.
Parameters:
-
g(object) –Gene-like token.
Returns:
-
str–Cleaned token.
Notes
- Trims whitespace and strips simple quote wrappers.
- Removes common list/export wrappers (brackets, trailing separators).
- Does NOT force uppercase (species/ID-system dependent).
dedup_preserve_order ¶
dedup_preserve_order(items)
De-duplicate strings while preserving first occurrence order.
Parameters:
-
items(list of str) –Input tokens.
Returns:
-
list of str–Deduplicated tokens in first-seen order.
Notes
Empty strings are ignored.
excel_force_text ¶
excel_force_text(s)
Prefix a value with a single quote to force Excel to treat it as text.
Parameters:
-
s(object) –Input value.
Returns:
-
str–Excel-safe text representation. Empty input returns "".
excel_safe_ids ¶
excel_safe_ids(x, *, list_sep=ID_JOIN_DELIM)
Convert an ID field into an Excel-safe, TSV-friendly text string.
This helper accepts either scalar or list-like inputs, parses them via
parse_id_list(), joins the IDs with list_sep, and prefixes a single
quote to force Excel "Text" interpretation.
Parameters:
-
x(object) –Scalar or list-like ID field.
-
list_sep(str, default:ID_JOIN_DELIM) –Join delimiter for the ID list. Default is
ID_JOIN_DELIM.
Returns:
-
str–Excel-safe text value. Returns "" if the input is NA-like or empty.
hash_gene_set_12hex ¶
hash_gene_set_12hex(genes)
Compute a set-stable gene-set fingerprint (12-hex), preserving case.
Parameters:
-
genes(list of object) –Gene tokens.
Returns:
-
str–12-character lowercase hex fingerprint.
Notes
Policy:
- order-invariant (set-stable)
- clean_gene_token() per token
- no forced uppercasing (species/ID dependent)
hash_gene_set_12hex_upper ¶
hash_gene_set_12hex_upper(genes)
Compute a legacy-compatible gene-set fingerprint (12-hex), uppercasing IDs.
Parameters:
-
genes(list of object) –Gene tokens.
Returns:
-
str–12-character lowercase hex fingerprint.
Notes
Use only when you must match older outputs that case-folded gene IDs.
hash_set_12hex ¶
hash_set_12hex(items)
Compute a generic set-stable fingerprint (12-hex) from a list of items.
Parameters:
-
items(list of object) –Input items.
Returns:
-
str–12-character lowercase hex fingerprint.
Notes
Trims tokens, drops NA-like values, de-duplicates, sorts, then hashes.
is_na_scalar ¶
is_na_scalar(x)
Determine whether a value should be treated as NA as a scalar.
This function avoids calling pandas.isna on list-like containers
because it can return array-like results and break boolean contexts.
Parameters:
-
x(object) –Input value.
Returns:
-
bool–True if
xis a scalar NA value (or None). Containers return False.
Notes
Strings like "na"/"nan" are not treated as scalar NA here; use
is_na_token() for token-level NA checks.
is_na_token ¶
is_na_token(s)
Check whether a value represents an NA token (case-insensitive).
This is a spec-level helper used across parsing and TSV round-trips. The NA vocabulary is centralized to prevent contract drift.
Parameters:
-
s(object) –Input value.
Returns:
-
bool–True if
sis None or its trimmed lowercase string form is in the NA token set.
Notes
This function treats empty strings as NA.
join_genes_tsv ¶
join_genes_tsv(genes)
Join gene tokens into a TSV-friendly string.
Parameters:
-
genes(list of object) –Gene tokens.
Returns:
-
str–Genes joined by
GENE_JOIN_DELIM.
Notes
Applies clean_gene_token() and drops empty/NA tokens. Does not sort;
preserves input order.
join_id_list_tsv ¶
join_id_list_tsv(ids, *, delim=ID_JOIN_DELIM)
Join generic identifiers into a TSV-friendly string.
The join is stable and order-preserving. This function is intentionally not gene-aware to avoid over-normalization at the spec boundary.
Parameters:
-
ids(list of object) –Identifiers to join. None/empty/NA-like tokens are dropped.
-
delim(str, default:ID_JOIN_DELIM) –Delimiter for joining. Default is
ID_JOIN_DELIM.
Returns:
-
str–Joined identifier string.
Notes
- Preserves input order (no sorting).
- Does not apply
clean_gene_token().
join_tags ¶
join_tags(tags, *, delim=STRESS_TAG_DELIM)
Join tags into a canonical stress tag string.
Parameters:
-
tags(list of object) –Tag tokens.
-
delim(str, default:STRESS_TAG_DELIM) –Join delimiter. Default is
STRESS_TAG_DELIM(comma).
Returns:
-
str–Canonical tag string.
Notes
Trims whitespace, drops empties, and de-duplicates in first-seen order.
looks_like_12hex ¶
looks_like_12hex(x)
Check whether a value is exactly 12 lowercase hex characters.
Parameters:
-
x(object) –Input value.
Returns:
-
bool–True if
xmatches the 12-hex pattern (lowercase).
make_term_uid ¶
make_term_uid(source, term_id)
Construct a stable term_uid from (source, term_id).
Parameters:
-
source(object) –Term source (e.g., "fgsea", "metascape"). Empty maps to "unknown".
-
term_id(object) –Term identifier. Caller should ensure it is non-empty.
Returns:
-
str–Term UID formatted as "
: ".
module_hash_content12 ¶
module_hash_content12(terms, genes)
Compute a module content hash binding both term set and gene set (12-hex).
Parameters:
-
terms(list of object) –Term identifiers.
-
genes(list of object) –Gene tokens.
Returns:
-
str–12-character lowercase hex fingerprint.
Notes
- Terms:
canonical_sorted_unique()(no uppercasing) - Genes:
clean_gene_token()+ drop NA/empty + sort/dedup (no uppercasing) - Payload format is stable and explicit to prevent ambiguity.
norm_gene_id_upper ¶
norm_gene_id_upper(g)
Normalize a gene token by applying conservative cleaning and uppercasing.
Parameters:
-
g(object) –Gene token.
Returns:
-
str–Cleaned and uppercased token.
Notes
This is opt-in for legacy compatibility. The default spec policy in this module is to preserve case.
normalize_direction ¶
normalize_direction(x)
Normalize direction vocabulary across schema/distill/audit/select.
Parameters:
-
x(object) –Input scalar.
Returns:
-
str–One of {"up", "down", "na"}.
Notes
This is a lightweight normalizer. Unrecognized values map to "na".
normalize_gate_mode ¶
normalize_gate_mode(x, *, default='note')
Normalize a gate mode to canonical vocabulary: {"off", "note", "hard"}.
Parameters:
-
x(object) –Input value (canonical, synonym, or legacy form).
-
default(str, default:'note') –Default to use when
xis empty. If invalid, falls back to "note".
Returns:
-
str–Canonical gate mode: "off", "note", or "hard".
Notes
Accepted synonyms include: - off: off, none, disable, disabled - note: note, warn, warning, soft - hard: hard, strict, abstain, on, enable, enabled
normalize_status_series ¶
normalize_status_series(s)
Normalize a pandas Series of statuses to uppercase strings.
Parameters:
-
s(Series) –Input series.
Returns:
-
Series–Series with string dtype, trimmed and uppercased.
Notes
NA values may become strings (e.g., "nan") after astype(str).
Always validate with validate_status_values() when needed.
normalize_status_str ¶
normalize_status_str(x)
Normalize a status value into canonical uppercase text.
Parameters:
-
x(object) –Input scalar.
Returns:
-
str–Uppercased, trimmed string.
Notes
This function does not validate membership in ALLOWED_STATUSES.
Use validate_status_values() for strict checking.
parse_genes ¶
parse_genes(x)
Parse evidence genes from messy inputs into a list of cleaned tokens.
Parameters:
-
x(object) –Scalar or list-like gene field.
Returns:
-
list of str–Cleaned gene tokens, deduplicated in first-seen order.
Notes
Rules:
- NA scalars -> []
- list/tuple -> cleaned per-token
- set -> sorted for determinism, then cleaned
- string -> split conservatively via split_gene_string()
parse_id_list ¶
parse_id_list(x)
Parse a generic ID field into a list of strings.
This is a tolerant parser for ID-like fields (term IDs, module IDs,
gene IDs when treated as IDs, etc.). It is intentionally separate from
parse_genes(), which is more gene-token-aware.
Parameters:
-
x(object) –Scalar or list-like input.
Returns:
-
list of str–Parsed IDs in deterministic order.
Notes
Policy: - NA scalars -> [] - list/tuple -> preserve order (dedup) - set -> sorted for determinism (dedup) - string -> split on strong delimiters first: ',', ';', '|' - whitespace split only if all tokens look identifier-like - drop NA tokens and empties
seed_for_term ¶
seed_for_term(seed, term_uid, term_row_id=None)
Create a deterministic per-term integer seed.
The seed is derived from (seed, term_uid, term_row_id) using a stable
hash to keep RNG streams reproducible across platforms.
Parameters:
-
seed(int or None) –Optional base seed. None maps to 0.
-
term_uid(str) –Stable term identifier (e.g., "
: "). -
term_row_id(int or None, default:None) –Optional row identifier to avoid collisions for duplicate term_uids.
Returns:
-
int–Deterministic unsigned integer seed.
Raises:
-
ValueError–If
term_row_idcannot be converted to int (when provided).
seed_int_from_payload ¶
seed_int_from_payload(payload, *, mod=2 ** 31 - 1)
Derive a deterministic integer seed from an arbitrary payload.
Parameters:
-
payload(object) –Any JSON-serializable payload.
-
mod(int, default:2 ** 31 - 1) –Modulus for the resulting seed. Default is 2**31 - 1.
Returns:
-
int–Deterministic integer seed in [0, mod).
Notes
Uses sha256_short(..., n=12) to keep stability aligned with other IDs.
sha256_12hex ¶
sha256_12hex(payload)
Compute a deterministic short SHA-256 hash (first 12 hex chars).
Parameters:
-
payload(str) –Stable string payload.
Returns:
-
str–12-character lowercase hex digest.
sha256_short ¶
sha256_short(obj, n=12)
Compute a deterministic SHA-256 short hash from an arbitrary payload.
Parameters:
-
obj(object) –Payload to hash. It is serialized via
stable_json_dumps(). -
n(int, default:12) –Number of hex characters to return. Default is 12.
Returns:
-
str–Lowercase hex digest prefix.
Raises:
-
ValueError–If
nis not positive.
Notes
- For n == 12, this matches the legacy behavior (
sha256_12hex). - SHA-256 hex digests have length 64; if n > 64, the output length is effectively capped at 64 by Python slicing.
split_gene_string ¶
split_gene_string(s)
Split a gene string into candidate tokens using conservative rules.
Parameters:
-
s(str) –Input gene string.
Returns:
-
list of str–Token candidates (not yet fully cleaned).
Notes
Supported formats: - Comma/semicolon/pipe separated: "A,B", "A;B", "A|B" - Bracketed lists: "['A','B']", '["A","B"]', "{A,B}" - Slash-separated as a last resort: "A/B/C" - Whitespace-separated only if all tokens look gene-like
split_tags ¶
split_tags(s, *, delim=STRESS_TAG_DELIM)
Split a stress tag string into normalized tags.
Parameters:
-
s(object) –Input scalar tag string.
-
delim(str, default:STRESS_TAG_DELIM) –Canonical delimiter. Default is
STRESS_TAG_DELIM(comma).
Returns:
-
list of str–Tags in first-seen order.
Notes
- Canonical delimiter is comma.
- Legacy '+' is tolerated as an additional delimiter.
stable_json_dumps ¶
stable_json_dumps(obj)
Serialize an object to deterministic JSON for hashing/provenance.
Parameters:
-
obj(object) –JSON-serializable object.
Returns:
-
str–Deterministic JSON string.
Notes
Uses: - sort_keys=True - separators=(",", ":") - ensure_ascii=False
strip_excel_text_prefix ¶
strip_excel_text_prefix(s)
Strip the Excel "force text" prefix from a value.
Excel-safe exports sometimes prefix values with a single quote ('). This helper removes one leading quote to support downstream parsing.
Parameters:
-
s(object) –Input value.
Returns:
-
str–Cleaned string without a single leading quote.
validate_status_values ¶
validate_status_values(s_norm)
Strict validation: refuse unknown status values (auditable denominators).
Noise modules (gene noise dictionaries)¶
Curated gene-noise patterns used by masking/evidence hygiene steps.
llm_pathway_curator.noise_lists ¶
Noise module definitions (shared asset; conservative by default).
Rationale (paper-facing)
Marker rankings and enrichment evidence often contain ubiquitous programs (e.g., clonotypes, uninformative locus IDs) that can dominate prompts and confuse LLM interpretation. This module centralizes symbol-centric noise definitions that can be applied in prompt-facing layers while preserving evidence identity in PathwayCurator.
Policy (PathwayCurator)
LLM-PathwayCurator evaluates enrichment interpretations as audited decisions. Therefore, we do not pre-emptively remove broad biological programs (cell cycle, interferon, ribosome/mitochondria, HLA, Ig constants) from evidence by default, because they can be true biology and removing them can inflate ABSTAIN via missing/unstable evidence.
Reproducibility
Edit conservatively: changes may affect benchmark comparability. This file is dependency-free and safe to import.
```