Testing Non-Deterministic Dependencies: Deterministic LLM Stubs That Preserve the Production Path

Testing Non-Deterministic Dependencies: Deterministic LLM Stubs That Preserve the Production Path

There is a class of test suite that is worse than having no tests at all. It passes most of the time. It fails occasionally with no clear reason. It passes again after a retry. You trust it when it passes. You ignore it when it fails. And at some point, it misses a real regression because you've trained yourself to dismiss its failures as noise.

A test suite that calls GPT-4o directly is that class of test suite.

Not because language models are bad — they're not. But because a test that depends on a remote API with non-deterministic output, variable latency, and token-metered calls cannot be the thing you block deploys on. It will flake. It will cost money in CI. It will be wrong about the right things and right about the wrong things.

TaxLens has a solution to this problem that I think is worth writing about in detail: a stub transport that runs through the exact same circuit breaker, audit writer, Zod schema validation, and retry logic as the real OpenAI transport. Tests that use the stub are not testing a simplified mock. They're testing production code paths against controlled, steerable inputs.

Let's get cracking,


Where Non-Determinism Actually Lives

This is the first question worth asking precisely. The entire TaxLens backend is not non-deterministic — only one seam is.

The seam is openaiInvoke in openai-client.ts:

const openaiInvoke: Transport = async (params) => {
  const client = getClient();
  const result = await client.responses.parse({
    model,
    input: [...],
    text: { format: zodTextFormat(schema, schemaName) },
  });

  return {
    parsed: result.output_parsed,
    responseId: result.id,
    inputTokens: result.usage?.input_tokens ?? 0,
    outputTokens: result.usage?.output_tokens ?? 0,
  };
};

Everything before and after this function is deterministic:

  • The circuit breaker state machine (closed/open/half_open transitions on consecutive failures)
  • The Zod schema validation of the model's output
  • The repair-retry logic when output fails validation
  • The llm_audit writes (token counts, latency, circuit state)
  • The taxProcessRepository state transitions (validating → analyzing → ready/failed/needs_review)
  • The compareRegimes tax engine (pure TypeScript, no I/O)

That means the vast majority of the system is fully testable without any model involvement. The non-determinism is localized to a single function that can be swapped at the boundary.


The Transport Interface

The key architectural decision that makes all of this work is defining Transport as an explicit type:

export type Transport = <T>(params: StructuredCallParams<T>) => Promise<TransportResult>;

One line. Both openaiInvoke and stubInvoke implement this type. The shared shell — runStructured — receives a Transport and calls it. It never knows which one it has.

The selection happens once, at module initialization:

const transport: Transport = env.LLM_MODE === 'stub' ? stubInvoke : openaiInvoke;

LLM_MODE=stub in test and CI environments. LLM_MODE=live in production. Everything else runs unchanged.

This is the seam. And the invariant the seam must preserve is: the stub must be indistinguishable from the real transport to the code that uses it. Same return type. Same error types. Same audit writes. The shared shell runs whether the transport is real or stubbed — which means the circuit breaker, the schema validation, and the retry logic are all exercised by tests that never call OpenAI.


The Stub Transport: Steerable, Not Hardcoded

stub-transport.ts is not a single fixed response. It's a state machine with five steering mechanisms, checked in priority order:

1. Simulating a missing API key

if (env.LLM_STUB_UNCONFIGURED) {
  return Promise.reject(new UpstreamUnavailableError('AI is not configured (stub: unconfigured)'));
}

Every call throws. This simulates the path where OPENAI_API_KEY is absent. The real getClient() function throws the same UpstreamUnavailableError for the same reason. Tests that set LLM_STUB_UNCONFIGURED=true exercise the no-key path without requiring a missing key in the test environment.

2. Forced failure sequences for circuit breaker testing

let forcedFailuresRemaining = env.LLM_STUB_FAIL_TIMES;

if (forcedFailuresRemaining > 0) {
  forcedFailuresRemaining -= 1;
  return Promise.reject(new UpstreamUnavailableError('stub: forced upstream failure'));
}

LLM_STUB_FAIL_TIMES=3 forces three consecutive UpstreamUnavailableErrors across the next three calls, then reverts to the happy path. This drives the closed → open transition in the circuit breaker without any network involvement.

The module-level counter (forcedFailuresRemaining) persists across calls within a test run, which is exactly what you need to test consecutive-failure transitions. Tests that need to reset it call __resetStub(n):

export const __resetStub = (failTimes = env.LLM_STUB_FAIL_TIMES): void => {
  forcedFailuresRemaining = failTimes;
};

This is an explicitly test-only export. The naming convention (double underscore prefix) signals that it should never be called from production code.

3. Filename-based document routing

const filename = pdf?.filename ?? '';
if (filename === 'fail.pdf') {
  return Promise.reject(new UpstreamUnavailableError('stub: forced upstream failure (fail.pdf)'));
}
if (tier === 'gate') {
  const verdict = filename === 'reject.pdf' ? GATE_REJECT : GATE_HAPPY;
  return Promise.resolve(result(verdict, tier, code));
}

A PDF named reject.pdf triggers an invalid document verdict. A PDF named fail.pdf throws. Any other name gets the happy-path response. This makes test intent visible in the test file itself — the name of the PDF being uploaded describes the scenario being tested.

4. The Kuda bug fixture

This is the one I'm most proud of, because it came from a real production observation.

Kuda MFB bank statements produce credits with narrations like "Stac Intercontinental Ltd transfer" and "Abolarinwa Babafemi transfer." Both are income — regular client payments. But the word "transfer" appears in both, and the analysis model, if it applies the classification guidance too conservatively, tags both as transfer rather than business. Result: grossAnnualKobo: 0 despite inflows summing to over ₦1.6M.

The stub captures this as a permanent fixture:

const ANALYSIS_ALL_TRANSFER = {
  inflows: [
    {
      date: '2026-04-24',
      description: 'Stac Intercontinental Ltd transfer',
      amountKobo: 100_000_000,
      classification: 'transfer',
    },
    {
      date: '2026-05-04',
      description: 'Abolarinwa Babafemi transfer',
      amountKobo: 60_000_000,
      classification: 'transfer',
    },
  ],
  grossAnnualKobo: 0,
};

Setting LLM_STUB_ANALYSIS=all_transfer in the test environment makes the stub return this fixture for every analysis call. The pipeline then hits the A2 guard:

const inflowsSumKobo = inflows.reduce((s, f) => s + f.amountKobo, 0);
const needsReview = grossAnnualKobo === 0 && inflowsSumKobo > 0;

And routes to needs_review. This edge case is now tested on every commit without ever needing the actual Kuda statement format or a live model call.

This is the real value of a steerable stub: production observations become permanent regression tests. The first time you see an edge case in prod, you add it to the stub. It never surprises you again.

5. Non-conforming output for repair-retry testing

if (env.LLM_STUB_CHAT === 'nonconforming') {
  return Promise.resolve(result({ wrong: 'shape', answer: '' }, tier, code));
}

This returns an object that will fail the Zod schema validation in runStructured. The test exercises the repair-retry path:

  1. First attempt: stub returns { wrong: 'shape', answer: '' }
  2. schema.safeParse fails → LlmContractError
  3. runStructured logs a warning and retries
  4. Second attempt: same stub response (env hasn't changed) → same failure
  5. ProcessingError thrown: "The AI could not produce a grounded answer"

The test asserts that a ProcessingError is thrown and that the circuit breaker state did not change (a LlmContractError is not an outage — it should not count against the breaker). Both of these assertions validate production behaviour without calling OpenAI.


What the Shared Shell Actually Tests

The key insight is that both transports run through runStructured. A test using the stub is not testing a simplified version of the system — it's running the full production logic path.

Here's the complete set of production components that are exercised by stub-mode tests:

Circuit breakerbreaker.run(() => transport(params)) runs for every call, real or stub. Failure sequences correctly drive state transitions.

Audit writesllmAuditRepository.record(...) is called after every transport invocation. Tests can assert that audit records were created with the expected tier, model, circuitState, and error fields.

Schema validationschema.safeParse(result.parsed) runs on every transport response. A stub that returns a wrong shape triggers the same LlmContractError path as a misbehaving real model.

Repair retryLlmContractError triggers one retry attempt. Tests with LLM_STUB_CHAT=nonconforming drive this path to completion and assert on the final error type.

PII guarantee — The audit record stores promptHash, not the raw system or user prompt. This is verifiable in tests: assert that llm_audit records contain no bank statement content.


Testing the Tax Engine in Isolation

The tax engine (compareRegimes, computeFromGross) is a pure TypeScript function with no I/O. It takes income figures and profile parameters. It returns a full computation. It has no retries, no timeouts, no external calls.

Pure functions should be tested exhaustively with explicit input/output pairs. No stubs needed, no mocks, no setup:

// These are representative — a real suite would cover every NTA 2025 band boundary

test('salary earner at ₦2.4M annual gross computes correct liability', () => {
  const result = computeFromGross('salary_earner', 240_000_000); // 240M kobo = ₦2.4M
  expect(result.newRegime.taxPayableKobo).toBe(/* NTA 2025 Fourth Schedule calculation */);
  expect(result.recommendation).toBe('new_regime');
});

test('gross of zero produces zero liability', () => {
  const result = computeFromGross('salary_earner', 0);
  expect(result.newRegime.taxPayableKobo).toBe(0);
  expect(result.oldRegime.taxPayableKobo).toBe(0);
});

These tests run in milliseconds and give you complete confidence in the computation logic. They are the tests that matter most — because if the tax engine is wrong, users file incorrect returns. The LLM stub tests confirm the pipeline plumbing. The tax engine tests confirm the law is correctly implemented.


Testing the State Machine

The taxProcessRepository persists pipeline state transitions. The sequence is:

created → validating → analyzing → ready
                    ↘           ↘ needs_review
                     failed

These transitions are driven by taxProcessRepository.advance(code, newState, data). With an in-memory MongoDB instance (or a real test database), the state machine can be driven entirely through the stub transport:

  • reject.pdf → gate fails → advance to failed
  • fail.pdf → gate throws → advance to failed with unavailability reason
  • LLM_STUB_ANALYSIS=all_transfer → analysis produces zero gross → advance to needs_review
  • Default happy path → advance to ready with computed tax figures

No OpenAI calls. Full state machine coverage. Each scenario maps directly to a stub steering configuration.


Trade-offs

The stub must be maintained. When the real model's output schema changes, the stub must be updated to match. A stub that diverges from the real transport's contract gives you green tests against a broken production path — which is the original problem in a different form.

The mitigation: the Transport type acts as a contract. The stub implements Transport. If the TransportResult type changes, TypeScript will catch stub divergence at compile time for structural changes. Semantic drift (the model now returns a different classification vocabulary, for example) requires human attention — it's the kind of thing that should be caught by integration tests that run against real models periodically, not on every commit.

The fixture set is a maintenance surface. ANALYSIS_HAPPY, ANALYSIS_ALL_TRANSFER, GATE_HAPPY, GATE_REJECT, CHAT_ANSWER, CHAT_REFUSE — each fixture is a claim about what the real model would return. As the model's system prompts evolve, the fixtures can fall out of sync.

The practice I follow: when any system prompt changes, run the real model against a sample document and compare the output structure to the existing fixture. If the structure matches, the fixture is still valid. If not, update it. This is one manual step per prompt change, not per commit.

The stub cannot test model quality. The stub tests that the pipeline handles the model's output correctly. It says nothing about whether the model produces good output. Classification accuracy, income estimation quality, and chat answer relevance are not tested by stubs — they require human evaluation or a separate LLM-as-judge harness that runs against the real model on a schedule, not on every commit.


The Rule

Isolate non-determinism at the transport boundary. Define a Transport interface. Make the stub implement it exactly. Steer the stub via environment variables, not hardcoded single-case fixtures. The stub should be able to reproduce every production edge case you've ever seen — including the Kuda mis-classification, the forged document, the 503 during peak traffic, and the model that returns empty output on an otherwise valid prompt.

When a new edge case appears in production, the first commit is always the same two things: the fix, and the stub configuration that reproduces the scenario it came from.

The stub is not a test convenience. It's a production incident archive.