Build a Basic AI Agent in TypeScript, Part 4: Production Readiness and Operational Guardrails

Overview

Part 1 covered the loop, Part 2 covered tool calling, and Part 3 covered multi-step skill orchestration. This final part is about operating an agent in production, not just getting good outputs in a demo. The goal is simple, define guardrails before the agent can touch real systems, real data, and real users.

TLDR: Production agent work is operational design. Define what the agent can execute, require approval for high-impact actions, log decision paths, evaluate route quality over time, and make ownership explicit before launch.

Full Series

Part 1, The Core Loop
Part 2, Tool Calling and LLM Intent Routing
Part 3, Multi-Step Skill Orchestration
Part 4, Production Readiness and Operational Guardrails (you are here)

Note: This is not an exhaustive guide to distributed systems. It is a practical write-up based on personal tinkering with multi-step orchestration agents.

I have been exploring agents and LLMs since ChatGPT first launched, and I have spent a lot of time building small agents and multi-step orchestrations. There are excellent models available now, but moving an agent from a simple loop to something production-ready is a big jump. That jump is less about model quality, and more about applying the same software engineering basics we already rely on in distributed systems.

Sections

Why Production Readiness Is a Different Problem
Workflow Orchestration vs Single-Step Skill Execution
Safety Boundaries and Action Classification
Security, Access Control, and PII Handling
Human Approval Gates for High-Impact Actions
Idempotency and Duplicate Action Prevention
Observability Primitives You Need on Day One
Evaluation Loops: Replay, Route Accuracy, and Drift
Reliability Policies: Retries, Timeouts, and Rate Limits
Operational Readiness: Runbooks, Alerts, and Rollback
Ownership Model and Cross-Functional Rollout
Production Readiness Checklist
Final Takeaway
Official References

Why Production Readiness Is a Different Problem

Prototype quality and production quality optimize for different outcomes.

Prototypes optimize for speed and learning.
Production systems optimize for safe, repeatable outcomes.
In production, failure handling is part of the feature.

Example: a demo may show one successful tool call. Production requires retries, timeout limits, approvals, and rollback when that same call fails or duplicates.

A working demo is useful evidence, but it is not an operational design.

Workflow Orchestration vs Single-Step Skill Execution

Single-step skill execution handles one bounded task, for example, fetch one record and return it.

Workflow orchestration coordinates multiple dependent steps across systems. Each step can fail, retry, or require escalation.

The distinction matters because risk compounds across steps.

Example:

Step 1 reads account status.
Step 2 writes a billing adjustment.
Step 3 sends an external email confirmation.

Each step is understandable alone. The full chain can create user impact, billing impact, and support impact.

Safety Boundaries and Action Classification

Before launch, classify actions by impact.

Read-only actions, low risk.
Internal writes, medium risk.
External writes, billing, identity, or irreversible actions, high risk.

This classification should drive policy, not model preference. The model can suggest an action. Policy decides whether execution is allowed.

Security, Access Control, and PII Handling

Security boundaries should be explicit before launch.

Apply least privilege to tool access.
Scope tool permissions by role, tenant, and environment.
Enforce authorization checks in the runtime, not only in prompts.
Deny by default, then allowlist required capabilities.

PII handling matters just as much as execution control.

Redact or tokenize sensitive fields before logs are written.
Keep replay datasets sanitized and access-controlled.
Set retention rules for transcripts and execution traces.
Avoid copying raw PII into prompts unless required for task completion.

Example: if a support transcript includes full card details and you replay it in a lower environment, you create a second security problem while debugging the first one.

Observability and replay need controls. Logging everything without controls creates new risk.

Human Approval Gates for High-Impact Actions

Human approval is a control for high-consequence operations.

Good approval gates include:

Clear summary of intended action.
Input values and affected targets.
Reason the model selected this path.
Explicit approve or reject outcome with audit trace.

Example: “Refund $2,400 to account 9182” should never execute from model confidence alone. It should require explicit approval with a recorded decision.

Idempotency and Duplicate Action Prevention

Agents run in distributed systems where retries or duplicate execution can and will happen.

Idempotency keys turn repeated requests into one logical operation.

Example: if a network timeout triggers a retry after a charge request, idempotency prevents a second charge.

Observability Primitives You Need on Day One

At minimum, each execution should emit:

request id and correlation id,
selected skill or handler,
tool calls made,
latency,
token cost,
final status.

These fields answer three operational questions quickly: what happened, where it failed, and who owns the fix.

Evaluation Loops: Replay, Route Accuracy, and Drift

Agent quality can degrade quietly if behavior is not evaluated over time.

A lightweight evaluation loop can start with:

transcript replay tests,
expected route assertions,
periodic route accuracy tracking,
review of false-positive and false-negative routing outcomes.

Example: if route accuracy drops after a prompt update, the issue might be policy wording, not model capability.

Evaluation is for model quality, policy quality, and workflow design quality.

Reliability Policies: Retries, Timeouts, and Rate Limits

Reliability policies should be explicit and per step.

Retry only transient failures.
Apply timeout budgets per operation.
Enforce rate limits at skill and system boundaries.

Example: without a timeout budget, one slow dependency can block worker capacity and cascade latency across unrelated requests.

These controls are baseline protection against runaway failure modes.

Operational Readiness: Runbooks, Alerts, and Rollback

Production readiness requires operational expertise.

Define alert thresholds for failure rate and latency.
Maintain runbooks with triage and escalation paths.
Predefine rollback strategy for bad prompts, bad routes, or bad policy changes.

Example: if a routing change sends 30 percent of requests to the wrong skill, rollback should be one documented action, not an incident-time debate.

Ownership Model and Cross-Functional Rollout

Agent systems span product, engineering, operations, security, and support.

Set clear ownership boundaries:

Who owns routing logic.
Who owns policy changes.
Who approves high-risk skill updates.
Who handles incidents.

Clear ownership shortens incident response and reduces policy drift.

Production Readiness Checklist

Workflow scope and action risk classes are documented.
Human approval exists for high-impact actions.
Tool access is scoped by role and least-privilege policy.
Idempotency strategy is implemented for side-effecting operations.
Observability fields are emitted for every execution.
Logging and replay datasets are sanitized for PII.
Replay-based evaluation runs on real transcripts.
Retry, timeout, and rate-limit policy is defined per step.
Alerting, runbooks, and rollback paths are tested.
Ownership and escalation paths are clear.

Final Takeaway

Operating agents in production follows the same principles as any distributed system: observability, guardrails, reliability policy, rollback planning, and clear ownership.