If you cannot reproduce, six months from now, why your AI told a customer what it told them — you do not have an AI capability. You have a liability. Provenance + observability is the pillar that turns the first into the second.

What it actually means

Two related but distinct concerns:

  • Provenance — every answer can be traced back to the documents that grounded it, the model that wrote it, the prompt and context that framed it, and the user it was delivered to.
  • Observability — every call (retrieval hits, model invocations, tool calls, downstream integrations) is logged with enough metadata that an engineer or compliance officer can reconstruct the conversation later.

What we build

  • Per-call audit records: timestamp, user, role, retrieval IDs, model + version, system prompt fingerprint, response, token counts, latency, downstream tool calls.
  • Citation rendering in the UI: every assistant message surfaces the document IDs (or chunks) it was grounded on, with a visible "show sources" affordance.
  • Deterministic prompt + retrieval logging so the team can replay any past interaction byte-for-byte.
  • Dashboards and alerts: refusal rate, citation-coverage rate, model-failover events, abnormal latency, prompt-injection signal patterns.

The failure mode this prevents

Compliance shows up and asks: "Show me why your AI told this insured policyholder that their auto coverage included flood." You need to be able to answer in minutes, not weeks. You need to be able to point to the document, the retrieval call, the model output, the timestamp. If you can't, the AI is going to come out of production.

Provenance + observability is also the pillar that makes the AI improvable: without traceable logs, you cannot tell whether last month's refactor made answers better or worse. With them, you can.

← Back to Foundations overview