ARCHITECTURE REVIEW & ADVISORY
Agentic Architecture Patterns & Insights
Advisory session covering agent orchestration, governance patterns, knowledge architecture, and system design recommendations.
Martin Pratt & Ben Buckland
One-shot LLM calls with no iteration, no validation, and no ability to pause for missing information.
No iteration — single pass, no ability to correct or refine. If the LLM got it wrong, start over.
No validation — output went straight to Excel with no battery of tests on the created rows.
No pause/resume — if information was missing, the entire process had to restart from scratch.
No planning — the LLM had to determine and execute everything in a single context window.
An API-exposed orchestrator managing a planner and builder agent loop, with MCP-exposed tools for deterministic model manipulation.
A run consists of plan→build iterations that loop until hard completion metrics are achieved or safety constraints are hit.
CFADS & EBITDA
Hard-defined targets. When the model reaches Cash Flow Available to Debt Service and EBITDA, the run is considered complete.
Max Waves & Max Iterations
Prevents runaway loops. If the model hasn't completed within constraints, the run halts for human review.
Dynamic Injection
Completion metrics should become dynamically injectable — different models may have different definitions of “done.”
When information is missing, the run pauses, persists state, and waits for human input before resuming exactly where it left off.
The planner agent identifies that required data (e.g., tax rate, discount rate) is not available in the model metadata.
Run state is serialized to a database. The entire context — current wave, completed terms, pending work — is preserved.
Human analyst reviews the request, provides the missing data point, and hits enter. The system spools up again.
The data is created on the worksheet, the run resumes from exactly where it paused, and iterations continue.
Never let the thing doing the work evaluate its own output. Agents will execute their primary function at all costs — the “paperclip parable.”
Builder evaluates itself
It’s a hammer — everything looks like a nail. It will rationalize its own output to continue building, even when it should stop.
Planner evaluates itself
Planners are finite — they’re only out to create a plan. They will always find more to plan, even when the plan is sufficient.
Cross-evaluation
Builder evaluates planner output, a separate evaluator evaluates builder output. Different roles with different objectives create natural tension.
Separate governance layer
A dedicated governance component evaluates outputs and makes routing decisions. It has no incentive to build or plan — only to judge.
Governance sits between agents and decides what happens next. It evaluates, applies rules, and routes — separating these concerns prevents agents from overriding critical decisions.
By separating governance from execution, you create natural opposition. The builder wants to build; the governor wants correctness. This tension is the feature.
Bash scripts, hooks, fast deterministic checks. Governance should be super fast. Complex evaluation can run via SDK on hooks — pre-build and post-build.
This isn't binary. Each subsystem sits somewhere on a spectrum between reliability (consistent but brittle) and resilience (adaptive but less predictable). Break the system down and decide per component.
| Reliability | Resilience | |
|---|---|---|
| Behavior | Consistent, predictable output every time | Finds ways around novel problems |
| Failure Mode | Brittle — breaks under novel conditions | May produce unexpected but valid results |
| Best For | Validation, ledger operations, deterministic tools | Planning, reasoning about unknown domains |
| Formuly Example | validate_term — deterministic battery of tests |
Planner — reasoning about what to build next |
A purpose-driven content theory for prompting. The “Why” is the most underutilized component but the most impactful for resilience.
What action to perform, what output to produce. The core instruction the agent needs to execute.
Temporal context and conditions. When to act, under what circumstances, what state triggers this behavior.
The most critical component. Explains the reasoning behind the action. Lets the LLM reason about edge cases and make good judgment calls when conditions are ambiguous.
Specific instructions, constraints, format requirements. The mechanical details of execution.
When the LLM understands why it's doing something, it can reason about novel situations it hasn't been explicitly programmed for. This is the difference between a brittle rule-follower and a resilient reasoning agent.
Routing tables decide when to load additional context. But deeply nested routing can degrade reliability to a coin flip.
Nested routing tables → 50/50 reliability
Multi-layer routing structures compound reliability losses. Each layer that must correctly route reduces the probability of the right context being loaded. At depth, you’re essentially flipping a coin.
Agents will claim they’ve loaded a skill when they haven’t. You need deterministic verification — a bash script that checks and writes the actual usage to a table. Trust, but verify.
Massive flattening effort
Keep routing as shallow as possible. Eliminate unnecessary nesting. One level of routing that directly loads the right skill is vastly more reliable than three levels of indirection.
Deterministic checks that the correct skills are loaded before the agent proceeds. If the skill isn't there, the agent cannot start work. This is a hard gate, not a suggestion.
Build a balanced ecosystem of agents and skills. Every agent has a tendency — create oppositional forces that naturally check each other.
Agents Take Courses
~9 top-level courses, ~30 sections. UX research, front-end, back-end, business process, architecture, and more. Loading a course = loading the required context.
Very Specific, Very Narrow
Skills should be the least amount that covers everything that can be done. Focus the agent’s attention. Narrow scope = better quality and consistency.
Multi-Domain Skills
Some skills span domains — research + technical analysis, for example. These cross-sectional skills bridge the gaps between specialist agents.
Combine confidence evaluations to improve quality when you have many skills. Use the same eval primitives across the whole system. Engineers feed knowledge back into courses. 80% → 99% through incremental refinement.
Match the intelligence of your evaluation to the complexity of the judgment. Don’t use Opus where a bash script will do.
Deterministic Scripts
Bash scripts, hooks, hard rules. Completely deterministic. Run in milliseconds. Use for structural validation, format checks, constraint enforcement.
Haiku-Level LLM
Lightweight LLM checks with clear non-deterministic rules. Good for semantic similarity, basic reasoning, format validation beyond regex.
Sonnet-Level Reasoning
Extended thinking. Reasoning required. Can inject chain-of-thought for complex evaluations that need to weigh multiple factors.
Opus-Level Qualitative Analysis
Full qualitative judgment. Reserve for complex evaluation where the quality of reasoning directly impacts downstream correctness. Slowest but most capable.
Create a structured domain model using an Entity Relationship Diagram. Knowledge graphs store the data points and relationships, queryable via graph traversal for consistent, fast access.
Deterministic structure for data. If it’s different every time, results will be different every time. Consistency requires structure.
Term dependencies from the planner can auto-generate the customer ERD. Standard template + custom per-customer graph generated dynamically.
Very fast relational queries. AI can translate natural language to Cypher. “Where did that number come from?” → trace to source document.
As the financial model grows, naive RAG will slow down. Graph-based retrieval with targeted vector subsets is more efficient for structured, relationship-heavy data.
| Vector RAG | Graph-Based | |
|---|---|---|
| Best For | Large documents, books, 100+ pages of unstructured text | Structured terms, relationships, dependencies between entities |
| Performance | Degrades as corpus grows — queries all vectors every time | Graph traversal is fast; queries only relevant subgraph |
| Updates | Requires re-vectorization on change | Nodes and edges update instantly |
| For Formuly | Overkill for 10-page docs — just load the whole thing | Term dependencies + descriptions stored as graph nodes with metadata |
Graph nodes can point to vector subsets. Query the graph to find relevant nodes, then run vector search only within those document chunks. Fast graph traversal narrows the search space before any vector similarity computation.
“If it’s a 10-page doc, pull the whole thing. Who cares.” Vector adds complexity without value for documents under ~100 pages. Reserve it for genuinely large corpuses.
Formuly is aiming to become “financial modeling intelligence” — that’s both a system of record and a system of action. Each has different primary infrastructure.
Databases are the primary citizen
Recording what has happened. Historical data, audit trails, traceability. “Why is that number that number?”
Workflows are the primary citizen
Enabling humans and agents to do stuff. “Do this now, then do this, then compare.” Long-lived orchestration.
If you’re both, databases and workflows are equal citizens. Consider Temporal for workflow orchestration — “one of the best in the last 20 years.” Workflows should be long-lived and managed outside of application code.
Four high-impact actions that build on the current architecture and set the foundation for scale.
Ensure all system outputs — agent calls, tool invocations, evaluation results, governance decisions — adhere to the OTel standard. This is the foundation for debugging, improving evals, and understanding system behavior at scale.
Create services to persist the relationships between financial terms. “Revenue depends on volume and price. COGS depends on volume and unit cost.” This enables automated testing and model evolution tracking.
Pull evaluation and routing into a dedicated governance layer. Start with deterministic hooks and bash scripts. Add LLM-based evaluation only where deterministic checks aren’t sufficient.
Define Formuly’s own ERD (what is a worksheet, term, dependency, model). Create a generator for customer-specific ERDs built dynamically from planner output. This is the foundation for the knowledge graph.
The agent loop works.
Now build the infrastructure around it.
Martin Pratt & Ben Buckland • 31 March 2026