Two-Tier LLM Pipelines: Cost Firewalls for Production AI
The first time you check your OpenAI bill after a real traffic spike, something changes in you permanently. It's not the number itself it's the realisation that every engineering decision you made in development, every "just call the API" shortcut, every missing cache, is now a line item that scales with your users.
I've shipped two AI-backed products TaxLens, which analyses bank statements to estimate Nigerian income tax, and TrustRail, which underwrites BNPL applications from the same kind of documents. Both run GPT-4o in production. Both have cost architectures that are deliberately designed, not discovered after the fact. This is what those architectures look like and why they're shaped the way they are.
Let's dig in.
The Economics of Naive AI Pipelines
A naive AI pipeline has one design pattern: receive request, call the best model, return result. This works fine in development, where you're the only user and you're not watching the bill.
In production, the problems compound:
- Every user action maps to at least one expensive model call
- Abuse or unusual usage patterns map to an unusual bill
- OpenAI outages become your outages, with no graceful path
- A spike in traffic produces a proportional spike in cost no ceiling, no buffer
The cost model is fully linear and directly coupled to user behaviour. That's fine if your product has healthy unit economics on the AI spend. Most early-stage products don't.
The alternative isn't to avoid LLMs it's to design the pipeline so that expensive calls are gated, deferred, and fallback-protected. That's what "cost firewalls" means in practice.
The Gate Pattern: Pay for Validation, Not for Analysis
The most impactful single change in both TaxLens and TrustRail was introducing a cheap validation call before the expensive extraction call.
In TaxLens, the pipeline is two sequential model calls:
Tier 1 Gate (OPENAI_GATE_MODEL, a fast, cheap model):
const gate = await llmClient.structured({
tier: 'gate',
code,
model: env.OPENAI_GATE_MODEL,
system: GATE_SYSTEM,
user: 'Validate this bank statement.',
pdf: { filename, base64: pdfBase64 },
schema: GateVerdictSchema,
schemaName: 'gate_verdict',
});
if (!gate.data.valid) {
emit(await taxProcessRepository.advance(code, 'failed', {
failureReason: gate.data.reason || 'Not a usable Nigerian bank statement',
gateResponseId: gate.responseId,
}));
return; // analysis call never fires
}
Tier 2 Analysis (OPENAI_ANALYSIS_MODEL, a more capable model):
const analysis = await llmClient.structured({
tier: 'analysis',
code,
model: env.OPENAI_ANALYSIS_MODEL,
system: ANALYSIS_SYSTEM,
user: 'Extract and classify the inflows, then annualise income.',
pdf: { filename, base64: pdfBase64 },
schema: AnalysisSchema,
previousResponseId: gate.responseId,
schemaName: 'statement_analysis',
});
The gate model answers a boolean question: is this document a real, legible Nigerian bank statement? If no, the pipeline terminates without ever touching the analysis model.
The Cost Math
Approximate costs (gpt-4o-mini for gate, gpt-4o for analysis, PDF inputs):
- Gate call: ~$0.002 per document
- Analysis call: ~$0.06–0.08 per document
At a 15% invalid document rejection rate (photos of receipts, foreign bank statements, blank pages, users testing with wrong files), the cost per 1,000 uploads:
- Without gate: 1,000 × \(0.07 = \)70
- With gate: (1,000 × \(0.002) + (850 × \)0.07) = $61.50
That's a 12% saving, which compounds. But the more important number is what the gate saves on abuse: an attacker or a confused user uploading 100 non-bank-statement PDFs costs $0.20 with a gate, not $7.00 without one. The gate is a rate firewall as much as a cost firewall.
Condition: When the Gate Saves vs. Costs
The gate adds latency: a sequential second call that doesn't run in parallel. On fast infrastructure with a cheap gate model, this adds ~300–600ms. If your rejection rate is under 5%, the gate may cost more in cumulative latency than it saves in analysis calls. The threshold depends on:
- Gate model price vs. analysis model price (higher ratio → lower rejection rate needed to break even)
- Whether gate and analysis can share context via
previousResponseId(TaxLens does this the analysis continues the conversation from the gate response, avoiding re-sending the PDF) - Your users' accuracy in uploading the right document type
At a 10%+ rejection rate, the gate is unambiguously worth it. Below 5%, measure before committing.
The Queue as a Throughput Firewall
TrustRail's cost architecture is different from TaxLens because the use case is different. TaxLens is interactive the user uploads and waits for a result in the same session. TrustRail is asynchronous a business submits an application, and the analysis runs in a background job.
The statementAnalysisJob runs every 60 seconds. It fetches a maximum of 10 pending applications (FIFO, oldest first) and processes them sequentially:
const pendingApplications = await Application.find({
status: 'PENDING_ANALYSIS',
})
.sort({ submittedAt: 1 })
.limit(10);
This creates a hard throughput ceiling: at most 10 GPT-4o calls per minute, regardless of how many applications are submitted. A burst of 50 simultaneous submissions doesn't produce a burst of 50 simultaneous API calls it produces a queue that drains at a controlled rate over 5 minutes.
The cost implication: instead of "cost = f(submissions per second)", you get "cost = f(time)". The spend rate is predictable and bounded, independent of user behaviour spikes.
The Queue vs. Direct Call Trade-off
The queue imposes a latency penalty. An application submitted at the start of a busy minute might wait up to 10 minutes for its first analysis attempt if the queue depth is large. This is acceptable for TrustRail because the user experience is "we'll notify you when your application is processed" there's no interactive wait. It is not acceptable for TaxLens because the user is sitting on a loading screen.
The right pattern depends on whether your use case is request/response or fire-and-forget. If users must wait for the LLM response to continue, the queue is the wrong shape. If they submit and check back, the queue is exactly right.
The JS Fallback: The Zero-Cost Floor
TrustRail has a pure TypeScript underwriting engine (trustEngineService.ts) that can analyse a CSV bank statement without calling any external model. It's the fallback path when OpenAI is unavailable.
The job checks which path to use:
if (application.openai?.fileId) {
// Primary path: GPT-4o
try {
trustEngineOutput = await analyzeFileWithOpenAI(...);
} catch (error) {
// Fallback path: JS engine
if (application.bankStatementCsvData) {
trustEngineOutput = await analyzeApplication(application.applicationId);
} else {
throw error;
}
}
} else {
// Legacy path: JS engine only
trustEngineOutput = await analyzeApplication(application.applicationId);
}
The fallback doesn't just exist for availability it's the cost floor. During an OpenAI outage, the system keeps processing applications at $0 per analysis call. The JS engine is less capable (it can't read scanned PDFs, it relies on regex-based transaction classification rather than semantic understanding), but it produces a valid decision. Something beats nothing.
This creates a two-tier cost model:
- Normal operation: GPT-4o primary, ~$0.06–0.08 per application
- Degraded operation: JS engine only, $0.00 per application
The system never stops working. The cost never exceeds the primary path ceiling.
The Circuit Breaker: Protecting Against Cascade Cost
TaxLens has a three-state circuit breaker wrapping every OpenAI call:
closed → half_open → open
↑ |
└──────────────┘ (cooldown)
The state machine: accumulate consecutive failures while closed. At the failureThreshold (configurable), transition to open. While open, every call fast-fails with CircuitOpenError no API call is made. After cooldownMs, transition to half_open. Let one probe through. Success → closed. Failure → back to open.
async run<T>(fn: () => Promise<T>): Promise<{ result: T; stateAtCall: CircuitState }> {
const state = this.getState();
if (state === 'open') {
throw new CircuitOpenError(this.remainingCooldownMs());
}
if (state === 'half_open') {
if (this.halfOpenInFlight) throw new CircuitOpenError(this.remainingCooldownMs());
this.halfOpenInFlight = true;
}
try {
const result = await fn();
this.onSuccess();
return { result, stateAtCall: state };
} catch (err) {
this.onFailure();
throw err;
}
}
The cost implication: a spike of failed requests during an OpenAI degradation event doesn't generate a proportional number of timeout-based API calls (which cost tokens even when they fail). After 3 consecutive failures, the remaining requests in that burst fast-fail locally in microseconds without touching the API.
At 100 simultaneous uploads during an outage, without a circuit breaker, you might generate 100 failing API calls with partial token consumption. With a circuit breaker, you generate 3 failing calls, then 97 local fast-fails. The cost difference at scale isn't trivial.
What the Circuit Breaker Does Not Protect Against
The current implementation is per-process, in-memory. On a single Node.js instance, it works correctly. On two instances behind a load balancer, each maintains independent state one may be open while the other is closed. A 50/50 load split means roughly half of requests still hit the API despite the circuit being "open" in aggregate terms.
This is documented in the design as a v2 concern. Moving consecutiveFailures, state, and openedAt to a shared store (Redis or MongoDB) would make the breaker instance-aware. For a product running on a single instance, the in-memory version is the right starting point no Redis dependency, no distributed locking, minimal latency overhead.
The Audit Repository: Observability as a Cost Instrument
Neither cost architecture works without visibility. TaxLens records every LLM call to llm_audit:
export interface LlmAuditDoc {
code: string;
tier: LlmTier; // 'gate' | 'analysis' | 'chat'
model: string;
requestId: string;
promptHash: string; // SHA-256 of system + user never raw text
inputTokens: number;
outputTokens: number;
latencyMs: number;
circuitState: CircuitState;
error?: string;
createdAt: Date;
}
This is not just observability it's a cost ledger. Queries against this collection answer:
- Which tier is consuming the most tokens? (Gate calling gpt-4o by mistake would be immediately visible.)
- What's the median latency by model? (Useful for deciding which gate model to use.)
- Are any
codevalues accumulating unusually manychattier calls? (A user asking 40 follow-up questions in one session is a unit economics anomaly worth catching.) - What fraction of calls have
circuitState: 'open'? (If this is non-zero during business hours, the circuit threshold may need tuning.)
The promptHash field is deliberate: it's a SHA-256 of \({system}\n\){user} not the raw statement content. The audit record proves the model was called and what it cost. It does not store PII. A regulator can verify the audit trail. A GDPR delete request doesn't require touching the audit collection.
The Chat Tier: Controlling Interactive Costs
Both TaxLens and TrustRail have a chat feature users can ask follow-up questions about their results. This is the highest-risk tier for cost: a single user could send 50 questions. Each question is a model call.
TaxLens's aiService.ask uses conversation threading via previousResponseId:
const result = await llmClient.structured({
tier: 'chat',
code,
model: env.OPENAI_CHAT_MODEL,
system: SYSTEM,
user: `\({context}\n\nQUESTION: \){question}`,
schema: AnswerSchema,
schemaName: 'grounded_answer',
...(process.analysisResponseId !== undefined
? { previousResponseId: process.analysisResponseId }
: {}),
});
The previousResponseId chains the chat turn to the prior analysis response. This means the model doesn't need to re-receive the full bank statement and analysis context on every question it picks up where the previous turn left off. On a platform that caches prior responses server-side, this reduces prompt tokens on subsequent turns significantly.
The chat system prompt has one additional hard constraint:
You may ONLY explain the computed numbers provided to you in the CONTEXT below. NEVER produce, estimate, or invent a tax figure that is not already in that context.
This is both a grounding rule (from the previous article) and a cost rule. An LLM that freely computes new figures on demand will generate longer, more token-heavy responses. An LLM constrained to explain existing figures produces concise, bounded answers.
Trade-offs: What This Architecture Gives Up
Latency in the interactive case. The gate-then-analysis sequential structure adds at least one extra model round-trip before the user sees results. On a slow connection with a large PDF, the gate call alone can take 2–4 seconds. The total pipeline latency is gate + analysis + tax engine, not just analysis + tax engine.
Complexity. A single-model pipeline is easier to reason about, debug, and test. Two models with different responsibilities, a circuit breaker, a queue, a fallback JS engine, and an audit trail is more surface area. Every component is simple in isolation, but the interactions between them require careful thought.
Queue depth under load. The 10-per-minute processing ceiling in TrustRail means that during a large simultaneous submission event, applications wait. This is acceptable for asynchronous workflows and unacceptable for interactive ones. If TrustRail ever moves to synchronous approval, the queue architecture needs to be replaced or augmented.
Per-process circuit breakers break under horizontal scale. Documented above. Not a problem for single-instance deployments. A real problem when you scale out.
Evaluation
The gate pattern, combined with the circuit breaker and queue, produces three measurable properties:
Cost predictability Spend is bounded by the queue throughput ceiling and the gate rejection rate. Neither varies wildly with unexpected user behaviour.
Graceful degradation OpenAI unavailability does not produce user-visible errors in TrustRail (JS fallback) or runaway retry costs (circuit breaker). TaxLens users see a clear error message after the circuit opens, which is the right failure mode for an interactive product.
Audit fidelity Every model call is logged with token counts and latency before the response is used. Cost visibility is a first-class product feature, not something reconstructed from OpenAI's dashboard after the fact.
The Mental Model
Treat LLM calls the way you treat database writes: never make one per user action if you can gate, batch, cache, or defer.
The gate is a pre-write validation. The queue is a write buffer. The circuit breaker is a connection pool limit. The fallback engine is a read replica. These aren't novel AI infrastructure concepts they're standard distributed systems patterns applied to a dependency that happens to charge per request and has non-deterministic latency.
The bill is a feedback signal. If it's surprising, the architecture has a gap. Build the gap closed before the traffic arrives.