Verification: Property Testing, Invariants, and the Bridge to AI Evals (Part 12 of the Functional Programming Series)

Part 12 and the keystone of the Functional Programming in Practice series. Throughout the series I kept saying "we will verify this with property tests later." This is later, and it turns out the technique is the direct ancestor of the AI evals the whole industry is reinventing right now.

Hey there, Coding Chefs! 👨‍💻

I have been writing a lot lately about how I use AI agents to build and test software, automated QA flows, agentic test harnesses, evals that judge whether a model did its job. And the funny thing is, every time I design an eval, I get the strong feeling I have done this before. Describe what should be true, throw a pile of adversarial inputs at the system, and automatically flag anything that violates the rules.

That feeling is not déjà vu. It is that the functional programming community has been doing exactly this since the 1990s, under the name property-based testing. The eval techniques modern AI teams are scrambling to invent are, structurally, the same thing FP shipped decades ago. This final part closes the loop on the verification I kept promising, and opens the door to where my work is heading. By the end you will have a template for verifying any feature, AI or not. Let's get cracking.

The Problem With Example-Based Tests

Most of us test by example. You pick a few inputs, write down the expected outputs, and assert. add(2, 3) should be 5. cleanDescription(" hi ") should be "Hi". This is fine, and you should keep doing it. But it has a blind spot: you only test the cases you thought of. The bug lives in the case you did not think of, the empty input, the negative number, the unicode name, the value right at the boundary.

Property-based testing flips the approach. Instead of asserting specific input-output pairs, you assert a property that must hold for all inputs, then let a library generate hundreds or thousands of random inputs, including nasty edge cases you would never hand-pick, and check the property against every one. If it finds a violation, it even shrinks the failing input down to the smallest example that breaks it. In TypeScript the tool is fast-check.

This is exactly how you verify the laws I kept hand-waving past all series. Remember the Functor laws, or the Monoid associativity, the ones I said "TypeScript cannot enforce, so you verify them"? Property tests are how you verify them.

Case Study 1: Verifying a Pure Function and Its Laws

Take a pricing function, the kind of pure calculation that should sit in your functional core. Example tests check a few prices. Property tests check that the rules always hold.

import * as fc from "fast-check";

// property: applying a discount then tax equals applying them in the defined order,
// and the result is never negative and never exceeds the original for a positive discount.
fc.assert(
  fc.property(fc.float({ min: 0, max: 1_000_000 }), fc.float({ min: 0, max: 0.5 }), (price, discount) => {
    const final = applyDiscount(price, discount);
    return final >= 0 && final <= price; // an invariant that must ALWAYS hold
  })
);

You are not checking "price 1000 with 10% off is 900." You are checking "for every price and every valid discount, the result stays non-negative and never exceeds the original." fast-check throws thousands of generated pairs at it, including 0, tiny values, and boundary cases, and if any single one breaks the invariant, you hear about it with the minimal failing example. This is how you verify a Monoid is really associative or a Functor really obeys its laws: state the law as a property, generate inputs, let the machine hunt for a counterexample.

Case Study 2: Verifying Stateful Behavior With Invariants

Pure functions are the easy case. The harder case is stateful systems, a queue, a cache, a state machine. Here you use model-based testing: you describe the invariants that must hold no matter what sequence of operations runs, then generate random sequences of operations and check the invariants after each one.

Say you have a queue consumer that must process messages in order and never process the same message twice (ordering and idempotency, two classic invariants). You generate random interleavings of enqueue, process, retry, and crash-recover operations, and after each sequence you assert: every processed message appeared exactly once, and in order. A hand-written test would check one or two sequences. Model-based testing checks thousands of adversarial orderings, which is exactly where the nasty concurrency bugs hide, the duplicate-delivery and out-of-order bugs that only show up under a specific unlucky sequence.

The shape is always the same: describe the properties that must stay true, generate adversarial inputs, detect violations automatically. Hold onto that shape, because it is about to reappear somewhere you might not expect.

Case Study 3: Verifying an LLM, Which Is Just Evals

Now the bridge. Suppose you have an LLM-powered classifier, say it reads a bank transaction description and tags it as salary, loan repayment, or transfer. How do you test a thing that is non-deterministic and has no single "correct" output you can hard-code?

You reach for the exact same shape. Watch the parallel:

  • Golden sets are example-based tests. A curated set of inputs with known-correct labels, asserted directly. The few cases you are sure about.
  • Invariant checks are property-based tests for the model. "For any input containing a known salary employer, the output must never be transfer." "The confidence score must always be between 0 and 1." "The same input run twice should give the same label more than 95% of the time." These are properties that must hold across all inputs, checked against generated adversarial ones, identical in structure to the Functor-law and queue-invariant checks above.
  • LLM-as-judge is a model-based assertion. When there is no mechanical correct answer, you use another model, calibrated against human judgment, to score whether an output satisfies the property. It is the same "describe what must be true, check it automatically" move, with a learned judge standing in for a hard-coded oracle.

That is an eval. An AI eval is property-based testing with a fuzzier oracle. The salary classifier and the queue consumer and the pricing function are all verified by the same three-part skeleton: describe the properties, generate adversarial inputs, detect violations automatically. The FP community formalized that skeleton thirty years ago. The AI community is rediscovering it under a new name because the problem, trusting a system whose full input space you cannot enumerate, is the same problem.

The Thesis: Evals Will Mature by Absorbing FP's Playbook

Here is the prediction I will put my name on. Over the next couple of years, AI evals are going to mature largely by absorbing what functional programming already shipped: generative input fuzzing, shrinking failing cases to minimal reproductions, stateful model-based testing for multi-turn agents, invariant specification as the primary unit of a test rather than example pairs. The vocabulary will be new (evals, rubrics, judges, traces) but the bones are property-based testing, and the teams that already think in invariants will have a head start, because they are not learning a new idea, they are applying an old one to a new kind of system.

The reason this matters for you, the engineer reading a functional programming series, is that the discipline transfers. The same instinct that makes you ask "what property must hold for every input to this pure function?" is the instinct that makes you write a good eval for an agent. You have been training the right muscle this whole series.

A Starter Template for Verifying Anything

Here is the template, distilled. It works for a pure function, a stateful system, or an AI feature, because the skeleton is shared.

  1. State the invariants. Before writing a single test, write down what must always be true. Not example outputs, universal rules. ("Result is never negative." "Order is preserved." "A flagged transaction is never auto-approved.")
  2. Pin a golden set. A handful of inputs with known-correct outputs, asserted directly. Your anchor of certainty.
  3. Generate adversarial inputs. Use a generator (fast-check for code, a curated adversarial set or a generator model for AI) to produce the cases you would never hand-pick, especially boundaries and weird combinations.
  4. Check invariants automatically against every generated input. Let the machine hunt for the counterexample.
  5. Shrink and reproduce. When something fails, reduce it to the smallest input that still breaks the rule, and freeze that as a permanent regression test.
  6. For fuzzy oracles, add a calibrated judge. When there is no mechanical correct answer, use a judge (human-calibrated, possibly an LLM) to score whether the property held.

Run that loop and you have eval-driven development, whether the thing under test is a pricing function, a queue, or an agent.

But Everything Has a Cost

The honest caveats.

Property tests need good properties, and good properties are hard. The whole technique lives or dies on stating the right invariants. A weak property ("the function returns a number") catches nothing. Finding the invariant that actually pins down correctness takes real thought, and that thought is the work. The tool is easy, the thinking is not, which is fitting, because the thinking was the hard part all along.

Generation is not free. Thousands of generated cases cost time in CI, and for LLM evals every generated case is an API call that costs money. You bound the generation count, you cache, you sample, the same cost-discipline that shows up everywhere in production systems.

LLM-as-judge needs calibration or it lies confidently. A judge model that has not been checked against human ratings will happily approve bad outputs. The judge itself needs a golden set. Verification all the way down.

The Honest Conclusion

The verification I kept deferring all series is the same verification the AI world is reinventing right now. Property-based testing, describe what must be true for all inputs, generate adversarial cases, automatically catch violations, is how you check the Functor laws, the Monoid laws, a stateful queue, and an LLM classifier, because underneath they are the same problem: trusting a system whose entire input space you cannot list by hand.

That shared skeleton is the bridge from this series to where software is heading. The functional programming community built the playbook decades ago. AI evals are that playbook with a fuzzier oracle, and they will mature by absorbing it. If you have followed this series and learned to think in invariants and laws, you are already equipped for the next thing, you just have to point the same instinct at a new kind of system.

The techniques mature FP teams used to trust pure functions are the same ones that will make us trust models. Learn to think in invariants, and you are ready for whatever you have to verify next.

That is the series. Twelve parts, from "a type is a set of values" to verifying an AI. If you take one thing from all of it, take this: functional programming is a set of constraints you accept on purpose, and every constraint, purity, immutability, total functions, honest error types, buys you the same currency. Trust. Code you can read once, test without ceremony, compose without fear, and run without surprises. That is worth the week of feeling limited. Go build something with it.

Thanks for cooking with me, Coding Chefs. 👨‍💻