Formuly × Kaiba

Agentic Architecture Patterns & Insights

Advisory session covering agent orchestration, governance patterns, knowledge architecture, and system design recommendations.

Martin Pratt & Ben Buckland

Topics
3 Domains
|
Patterns
10+ Insights
|
Date
31 Mar 2026
Scroll or use arrow keys
01 — WHERE YOU WERE

The Previous Architecture

One-shot LLM calls with no iteration, no validation, and no ability to pause for missing information.

Excel Model Source of truth Metadata Digital twin LLM One-shot call Output Back to Excel
Limitation

No iteration — single pass, no ability to correct or refine. If the LLM got it wrong, start over.

Limitation

No validation — output went straight to Excel with no battery of tests on the created rows.

Limitation

No pause/resume — if information was missing, the entire process had to restart from scratch.

Limitation

No planning — the LLM had to determine and execute everything in a single context window.

02 — WHERE YOU ARE NOW

Agent Orchestration Architecture

An API-exposed orchestrator managing a planner and builder agent loop, with MCP-exposed tools for deterministic model manipulation.

Excel Source of truth Metadata Business terms Orchestrator API-exposed while(!done) loop Planner Opus • next 5-10 terms Builder Creates & validates MCP Tools create_worksheet create_term validate_term get_tax wave
4
MCP Tools
2
Agent Types
Opus
Planner Model
Det.
Validation Algo
03 — RUN LIFECYCLE

Waves, Iterations & Completion

A run consists of plan→build iterations that loop until hard completion metrics are achieved or safety constraints are hit.

Plan Next 5-10 terms wave Build Create & validate Evaluate CFADS? EBITDA? not done → next iteration Done
Completion Metrics

CFADS & EBITDA

Hard-defined targets. When the model reaches Cash Flow Available to Debt Service and EBITDA, the run is considered complete.

Safety Constraints

Max Waves & Max Iterations

Prevents runaway loops. If the model hasn't completed within constraints, the run halts for human review.

Future Direction

Dynamic Injection

Completion metrics should become dynamically injectable — different models may have different definitions of “done.”

04 — HUMAN-IN-THE-LOOP

Analyst Approval & State Persistence

When information is missing, the run pauses, persists state, and waits for human input before resuming exactly where it left off.

Planner Detects gap Pause Run Persist to DB Analyst “Tax rate = 39%” Resume Create & continue
Detect Missing Information

The planner agent identifies that required data (e.g., tax rate, discount rate) is not available in the model metadata.

Pause State Persisted

Run state is serialized to a database. The entire context — current wave, completed terms, pending work — is preserved.

Approve Analyst Provides Data

Human analyst reviews the request, provides the missing data point, and hits enter. The system spools up again.

Resume Continue Run

The data is created on the worksheet, the run resumes from exactly where it paused, and iterations continue.

05 — PATTERN: SEPARATION OF CONCERNS

Evaluation ≠ Execution

Never let the thing doing the work evaluate its own output. Agents will execute their primary function at all costs — the “paperclip parable.”

Anti-Pattern

Builder evaluates itself

It’s a hammer — everything looks like a nail. It will rationalize its own output to continue building, even when it should stop.

Anti-Pattern

Planner evaluates itself

Planners are finite — they’re only out to create a plan. They will always find more to plan, even when the plan is sufficient.

Pattern

Cross-evaluation

Builder evaluates planner output, a separate evaluator evaluates builder output. Different roles with different objectives create natural tension.

Pattern

Separate governance layer

A dedicated governance component evaluates outputs and makes routing decisions. It has no incentive to build or plan — only to judge.

06 — PATTERN: GOVERNANCE

Evaluate → Govern → Route

Governance sits between agents and decides what happens next. It evaluates, applies rules, and routes — separating these concerns prevents agents from overriding critical decisions.

Agent Output Plan or Build result Evaluate Det. + non-det. Governance Rules + context Fast: hooks, bash Continue Proceed as-is + Context Add info, retry Re-route Back to planner Escalate Human review
Make them fight against each other

By separating governance from execution, you create natural opposition. The builder wants to build; the governor wants correctness. This tension is the feature.

Keep governance lightweight

Bash scripts, hooks, fast deterministic checks. Governance should be super fast. Complex evaluation can run via SDK on hooks — pre-build and post-build.

07 — PATTERN: RESILIENCE VS RELIABILITY

A Per-Subsystem Decision

This isn't binary. Each subsystem sits somewhere on a spectrum between reliability (consistent but brittle) and resilience (adaptive but less predictable). Break the system down and decide per component.

Reliability Resilience
Behavior Consistent, predictable output every time Finds ways around novel problems
Failure Mode Brittle — breaks under novel conditions May produce unexpected but valid results
Best For Validation, ledger operations, deterministic tools Planning, reasoning about unknown domains
Formuly Example validate_term — deterministic battery of tests Planner — reasoning about what to build next
RELIABLE RESILIENT validate_term create_term Governance Planner
08 — PATTERN: PROMPT ENGINEERING

What / When / Why / How

A purpose-driven content theory for prompting. The “Why” is the most underutilized component but the most impactful for resilience.

What
The Task

What action to perform, what output to produce. The core instruction the agent needs to execute.

When
The Context

Temporal context and conditions. When to act, under what circumstances, what state triggers this behavior.

Why
The Reasoning

The most critical component. Explains the reasoning behind the action. Lets the LLM reason about edge cases and make good judgment calls when conditions are ambiguous.

How
The Method

Specific instructions, constraints, format requirements. The mechanical details of execution.

Why “Why” matters most

When the LLM understands why it's doing something, it can reason about novel situations it hasn't been explicitly programmed for. This is the difference between a brittle rule-follower and a resilient reasoning agent.

09 — PATTERN: ROUTING & CONTEXT

The Depth Problem

Routing tables decide when to load additional context. But deeply nested routing can degrade reliability to a coin flip.

The Problem

Nested routing tables → 50/50 reliability

Multi-layer routing structures compound reliability losses. Each layer that must correctly route reduces the probability of the right context being loaded. At depth, you’re essentially flipping a coin.

Anti-Sycophancy

Agents will claim they’ve loaded a skill when they haven’t. You need deterministic verification — a bash script that checks and writes the actual usage to a table. Trust, but verify.

The Solution: Flatten

Massive flattening effort

Keep routing as shallow as possible. Eliminate unnecessary nesting. One level of routing that directly loads the right skill is vastly more reliable than three levels of indirection.

Skill Validation

Deterministic checks that the correct skills are loaded before the agent proceeds. If the skill isn't there, the agent cannot start work. This is a hard gate, not a suggestion.

10 — PATTERN: ECOSYSTEM ENGINEERING

Courses, Skills & Oppositional Forces

Build a balanced ecosystem of agents and skills. Every agent has a tendency — create oppositional forces that naturally check each other.

Campus System

Agents Take Courses

~9 top-level courses, ~30 sections. UX research, front-end, back-end, business process, architecture, and more. Loading a course = loading the required context.

Skill Scope

Very Specific, Very Narrow

Skills should be the least amount that covers everything that can be done. Focus the agent’s attention. Narrow scope = better quality and consistency.

Cross-Sectional

Multi-Domain Skills

Some skills span domains — research + technical analysis, for example. These cross-sectional skills bridge the gaps between specialist agents.

Confidence evaluations across the ecosystem

Combine confidence evaluations to improve quality when you have many skills. Use the same eval primitives across the whole system. Engineers feed knowledge back into courses. 80% → 99% through incremental refinement.

11 — PATTERN: EVALUATION TIERS

Deterministic → Non-Deterministic Spectrum

Match the intelligence of your evaluation to the complexity of the judgment. Don’t use Opus where a bash script will do.

1
Fastest

Deterministic Scripts

Bash scripts, hooks, hard rules. Completely deterministic. Run in milliseconds. Use for structural validation, format checks, constraint enforcement.

2
Fast

Haiku-Level LLM

Lightweight LLM checks with clear non-deterministic rules. Good for semantic similarity, basic reasoning, format validation beyond regex.

3
Smart

Sonnet-Level Reasoning

Extended thinking. Reasoning required. Can inject chain-of-thought for complex evaluations that need to weigh multiple factors.

4
Deep

Opus-Level Qualitative Analysis

Full qualitative judgment. Reserve for complex evaluation where the quality of reasoning directly impacts downstream correctness. Slowest but most capable.

12 — PROPOSED: KNOWLEDGE ARCHITECTURE

Entity Relationships & Knowledge Graphs

Create a structured domain model using an Entity Relationship Diagram. Knowledge graphs store the data points and relationships, queryable via graph traversal for consistent, fast access.

Entity Relationship Diagram
Term Worksheet Document Dependency
Why ERD?

Deterministic structure for data. If it’s different every time, results will be different every time. Consistency requires structure.

Build On-the-Fly

Term dependencies from the planner can auto-generate the customer ERD. Standard template + custom per-customer graph generated dynamically.

Graph Traversal + Cypher

Very fast relational queries. AI can translate natural language to Cypher. “Where did that number come from?” → trace to source document.

13 — PROPOSED: RAG STRATEGY

Vector vs Graph — Know When to Use What

As the financial model grows, naive RAG will slow down. Graph-based retrieval with targeted vector subsets is more efficient for structured, relationship-heavy data.

Vector RAG Graph-Based
Best For Large documents, books, 100+ pages of unstructured text Structured terms, relationships, dependencies between entities
Performance Degrades as corpus grows — queries all vectors every time Graph traversal is fast; queries only relevant subgraph
Updates Requires re-vectorization on change Nodes and edges update instantly
For Formuly Overkill for 10-page docs — just load the whole thing Term dependencies + descriptions stored as graph nodes with metadata
Hybrid approach: vector within graph

Graph nodes can point to vector subsets. Query the graph to find relevant nodes, then run vector search only within those document chunks. Fast graph traversal narrows the search space before any vector similarity computation.

Practical advice: skip vector for small docs

“If it’s a 10-page doc, pull the whole thing. Who cares.” Vector adds complexity without value for documents under ~100 pages. Reserve it for genuinely large corpuses.

14 — PROPOSED: SYSTEM DESIGN

System of Record vs System of Action

Formuly is aiming to become “financial modeling intelligence” — that’s both a system of record and a system of action. Each has different primary infrastructure.

System of Record

Databases are the primary citizen

Recording what has happened. Historical data, audit trails, traceability. “Why is that number that number?”

SQL databases
Knowledge graphs
Object storage for large artifacts
Workflows are secondary
System of Action

Workflows are the primary citizen

Enabling humans and agents to do stuff. “Do this now, then do this, then compare.” Long-lived orchestration.

Temporal / state machines
Workflows live outside of code
Long-lived, durable execution
Databases are secondary
Formuly needs both

If you’re both, databases and workflows are equal citizens. Consider Temporal for workflow orchestration — “one of the best in the last 20 years.” Workflows should be long-lived and managed outside of application code.

15 — RECOMMENDED NEXT STEPS

Where to Go from Here

Four high-impact actions that build on the current architecture and set the foundation for scale.

1
Implement Observability (OpenTelemetry)

Ensure all system outputs — agent calls, tool invocations, evaluation results, governance decisions — adhere to the OTel standard. This is the foundation for debugging, improving evals, and understanding system behavior at scale.

2
Store Term Dependencies

Create services to persist the relationships between financial terms. “Revenue depends on volume and price. COGS depends on volume and unit cost.” This enables automated testing and model evolution tracking.

3
Separate Governance from Execution

Pull evaluation and routing into a dedicated governance layer. Start with deterministic hooks and bash scripts. Add LLM-based evaluation only where deterministic checks aren’t sufficient.

4
Build the Domain Model

Define Formuly’s own ERD (what is a worksheet, term, dependency, model). Create a generator for customer-specific ERDs built dynamically from planner output. This is the foundation for the knowledge graph.

The agent loop works.
Now build the infrastructure around it.

Martin Pratt & Ben Buckland • 31 March 2026

KAIBA