How I Use AI to Code Effectively, Part 2

Hi and welcome, this is a continuation of the Part 1 of my AI coding series, if you're new here, please read part one here first.

My testing stack looks normal at the bottom: Vitest for units, Jest where the team's already standardised on it, Cypress and Playwright for end-to-end, Testcontainers for integration against real Postgres, k6 for load. I click through the UI myself when something feels off. I write tests like everyone else.

What's different is what sits on top of all of that, a layer of AI QA agents that act like a small swarm of senior QA engineers. One drives a real Chromium browser via a CLI called agent-browser. One hits live APIs and the database directly. One runs my existing regression suites against the latest build, marks which tests need re-running after a fix, and writes new test scripts that get committed back into the suite. Another reviews the design system implementations against the source HTML. The Demo Director persona produces launch films from real product DOM.

The agents don't replace the testing infrastructure. They sit above it and exercise it intelligently, generating edge cases I wouldn't have thought of, load-testing patterns I wouldn't have bothered scripting, source audits I wouldn't have had time to run by hand.

None of the infrastructure underneath is novel. The interesting part is the layer of discipline that makes the AI agents above it produce QA passes I actually trust.

This is Part 2 of a series on how I use AI to code effectively. Part 1 covered spec-driven development, the persona/skill/codebase model, context management with Opus and Sonnet, and code review at scale. This part covers the agents that go beyond writing code, testing, design, orchestration, and the things I've built when no existing tool fit.

Everything I'll reference is in the open-source repo at github.com/spiderocious/agentic-workflow. Open it in a tab and follow along.

Let's get cracking.

Why Not Just Playwright? The obvious question: I have Playwright. I have Vitest. I have Cypress for some flows. Why bother adding an AI QA agent on top?

The short answer: because the framework runs the cases I already thought to write. The agent generates the cases I didn't.

A Playwright suite knows exactly what I told it. If I wrote 40 tests for the checkout flow, it runs 40 tests. It doesn't ask "what about the case where the user opens two tabs and submits both?" It doesn't notice that the empty state has no aria-label. It doesn't try the "rapid-fire click the submit button twice" race condition I forgot to spec. It doesn't decide to throw 50 concurrent requests at the idempotency endpoint just to see what happens.

The AI QA agent does all of that. It thinks like a senior QA engineer thinks, what could break this?, and then it goes and tries. It generates edge cases as a function of the feature shape, not as a function of the test list I happened to write three months ago.

Five things the agent does that the framework alone doesn't:

  1. Generates edge cases dynamically. Given a feature spec, the agent enumerates: happy path, empty state, boundary values, concurrent submissions, expired tokens, wrong roles, malformed input, network failures, race conditions, idempotency mismatches. The framework runs the cases the agent invents and the cases I previously wrote.
  2. Runs my existing regression suites and triages. When the agent kicks off a QA pass, it runs the full Vitest + Playwright suite first, parses the output, classifies failures into "actually broken" / "flaky" / "blocked by something earlier," and writes a triage report. The framework is the execution engine; the agent is the engineer reading the output.
  3. Writes new test scripts that get committed back. Edge cases the agent discovers become permanent. The agent generates a *.test.mjs or a Playwright spec that captures the failure, the fix lands, the test joins the regression suite. The agent feeds my pyramid.
  4. Drives interactive load tests. "Hit this endpoint 50 times in parallel with different idempotency keys and tell me what the DB looks like at the end" is one prompt. Setting that up in k6 is a script I'd skip writing for a one-off check. The agent does it in 20 seconds.
  5. Runs source audits in parallel with execution. Before the browser even opens, the agent greps the codebase for known anti-patterns (raw && in JSX, missing onError on mutations, useEffect + fetch races) and files them as bugs alongside the runtime tests. The two streams of findings, static and dynamic, land in the same report.

The relationship: the framework is the floor; the agent is the senior engineer running circles on top of it. Vitest doesn't get replaced. Playwright doesn't get replaced. They get used harder, and by something that knows what to look for beyond the cases I had time to write.

The Web QA Agent — 8 Phases

The web QA agent runs on agent-browser, a CLI tool that exposes a persistent Chromium daemon as bash commands. It complements Playwright rather than replacing it, Playwright owns the regression suite that runs on every commit; agent-browser is what the AI agent reaches for when it's exploring, generating edge cases on the fly, or driving the browser to reproduce a one-off scenario before deciding whether it deserves a permanent Playwright spec.

The persona at personas/qa-frontend.md loads two skills:

The agent follows an 8-phase loop on every QA pass:

Phase 1: Pre-flight

Confirm backend is healthy, confirm frontend is up, seed test data via API (never via UI):

curl http://localhost:3000/api/v1/health
curl http://localhost:5173 | head -3
TOKEN=$(curl -s -X POST http://localhost:3000/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"[email protected]","password":"Pass123!"}' | jq -r '.data.tokens.accessToken')

If the backend isn't responding, the agent stops. It doesn't try to test against a dead server. It says "the backend is down" and waits.

Phase 2: Source audit before opening the browser

Grep the codebase for known anti-patterns before touching the UI. File findings as CC-## (cross-cutting) entries in the test plan with P1/P2/P3 severity:

# Raw && in JSX (must be <Show when={...}>)
grep -rn "{.*&&" src/features/ --include="*.tsx" | grep -v "//\|test"

# Missing onError on mutations (silent failure bug)
grep -rn "useMutation\|mutationFn" src/features/ --include="*.ts" -l \
  | xargs grep -L "onError" 2>/dev/null

Every grep hit is a candidate bug. The agent files it before execution, then either confirms or refutes it in the browser.

Phase 3: Write the test plan

A markdown table written before execution. Columns: ID | Test | Expected | How to verify. This is the contract, what the agent will check, what passing means.

Phase 4: Open the browser session

agent-browser close --all
agent-browser open http://localhost:5173
agent-browser snapshot -i
agent-browser fill @e4 "[email protected]"
agent-browser fill @e8 "Pass123!"
agent-browser click @e6
agent-browser wait --url "**/dashboard"

The @e4, @e8, @e6 are accessibility-tree references from the snapshot. They reset on every snapshot, which is one of the agent's most common foot-guns (see the gotchas section).

Phase 5: Execute per case

For every test case: navigate → screenshot the initial state → action → wait for completion → screenshot the final state → record PASS / FAIL / SKIP / BLOCKED.

Critical discipline: never sleep. Always wait on an explicit signal:

agent-browser wait --load networkidle
agent-browser wait --text "Dashboard"
agent-browser wait --url "**/h/**"
agent-browser wait "#spinner" --state hidden

sleep 2 is the QA agent equivalent of try { ... } catch (e) { /* ignore */ }. It hides flakiness instead of fixing it.

Phase 6: Verify persistence after every mutation

After any state change, reload and re-check. A "success toast" that fires but doesn't actually persist is a real bug, and it's the kind that source review can't catch.

agent-browser click @e12   # Save button
agent-browser wait --load networkidle
agent-browser eval "document.body.innerText" | grep -E "success|saved"
agent-browser reload
agent-browser wait --load networkidle
agent-browser eval "document.body.innerText"   # Verify the change persisted

Phase 7: Test all four React Query states

Loading, success, error, empty. The agent tests all four for every data-fetching screen:

# loading — screenshot immediately on open
agent-browser navigate http://localhost:5173/feature
agent-browser screenshot /path/loading.png

# success — wait for data
agent-browser wait --text "Expected Content"
agent-browser screenshot /path/success.png

# error — mock the API
agent-browser network route "*/api/v1/feature*" \
  --body '{"error":{"code":"internal","message":"Service unavailable"}}'
agent-browser reload
agent-browser screenshot /path/error.png
agent-browser network unroute

# empty — fresh account or filter to nothing
agent-browser screenshot /path/empty.png

This catches the bugs nobody tests: the loading state that flickers, the error state that crashes, the empty state that shows "0" because of the && bug.

Phase 8: Write the execution report

The report comes after all tests complete. Never written mid-run. Format includes a Summary table, per-screen results, and new bugs found at the bottom.

DOM Patterns That Don't Lie

The agent has three patterns for interacting with the DOM that are non-obvious and worth showing.

React controlled-input fill

agent-browser type doesn't work on React controlled inputs, typing into a <input> whose value comes from useState doesn't fire React's onChange. The workaround:

agent-browser eval "
  const input = document.querySelector('input[name=email]');
  const setter = Object.getOwnPropertyDescriptor(window.HTMLInputElement.prototype, 'value').set;
  setter.call(input, '[email protected]');
  input.dispatchEvent(new Event('input', {bubbles: true}));
"

This uses the property descriptor setter to bypass React's input proxy. Or just use agent-browser fill @ref "text" which handles this internally.

agent-browser eval "document.body.childElementCount"
# Returns 2 = no modal
# Returns 3 = modal open

agent-browser eval "document.body.children[2].innerText"
agent-browser eval "document.body.children[2].querySelectorAll('button')[2].click()"

Most modal libraries portal into document.body. Counting body children is a cheap, reliable way to detect modal presence without selectors that change between renders.

Verify the API was actually called

The agent can record HAR (HTTP Archive) traces and inspect them:

agent-browser network har start
agent-browser eval "document.querySelectorAll('button')[3].click()"
agent-browser wait 2000
agent-browser network requests --method POST --filter /result
# If empty: button did NOT call the result endpoint
# If present: it did

This is how the agent catches "save button fires the wrong mutation" bugs that look fine in source review.


Never Stop Investigating on the First Error

The single most important QA agent anti-pattern, from the agent-browser-qa-guide.md skill:

Toast didn't appear → check if the API call was made at all → API call not made → check if the button click fired → button click fired → check if the mutation was set up correctly → mutation wrong → check what endpoint it's calling.

When something fails, dig one level deeper. The default "this test failed" output is useless. The "this test failed because the button click didn't fire the mutation because the mutation was bound to a different selector because the component re-rendered and lost its handler" output is actionable.

The agent's other hard rules (verbatim from the skill):

Never report PASS if you didn't verify it. Never mark a test PASS based on source code alone. Never use sub-agents for testing: agent-browser is operated directly via Bash. Sub-agents cannot see your browser session.

The browser is the source of truth. The agent's report is not.


Running the Existing Regression Suite, Triaging, and Feeding It Back

The QA agent doesn't just run its own ad-hoc tests. The first thing it does on any non-trivial pass is execute the project's existing Playwright + Vitest + Cypress suites against the current build, parse the output, and write a triage report.

pnpm test --run --reporter=json > /tmp/vitest-results.json
pnpm playwright test --reporter=json > /tmp/playwright-results.json

Then the agent reads the JSON output and classifies every failure:

  • Actually broken: assertion mismatched the implementation. File as a bug.
  • Flaky: passed on retry, or a timing-related failure pattern. File as a flake and add to the "stabilise" backlog.
  • Blocked by an earlier failure: a downstream test that depends on a fixture an earlier test was supposed to create. Mark BLOCKED, don't mark FAIL.
  • Out of date: assertion is checking for behaviour that was intentionally changed. The test needs updating, not the code. Flag for human review.

The triage report lands at the top of the QA pass output. By the time I read it, I already know which failures are real, which are noise, and which need a test update versus a code fix. This is the work that used to consume the first 30 minutes of every QA review.

Writing test scripts that get committed back

The most valuable thing the agent does, in my opinion: when it discovers an edge case during exploratory testing, it writes a permanent test for it.

The pattern: agent generates a "what about..." case → executes it against the live app → confirms it's a bug (or a missing test) → writes a *.test.mjs script (for API) or a Playwright spec (for UI) → adds it to the appropriate suite → opens a PR with the test.

Example: the agent is testing the bulk-reclassify endpoint. It tries 1, 10, 100, 1000 transactions. All pass. It tries 10,000. The endpoint times out at 30 seconds. The agent doesn't just report the bug: it writes:

// docs/qas/backend/scripts/bulk-reclassify-limits.test.mjs
await test('BR-LIM-01', 'Bulk reclassify of 10,000 transactions completes under 30s', async () => {
  const txns = await createTransactions(10_000);
  const start = Date.now();
  const res = await post('/statements/bulk-reclassify', { transactionIds: txns });
  const elapsed = Date.now() - start;
  assertStatus(res, 200);
  assert(elapsed < 30_000, `took ${elapsed}ms, expected < 30000ms`);
});

The script gets committed alongside the fix. The regression suite is now one edge case smarter. The next time someone touches the bulk-reclassify code, the suite catches the regression before it ships.

This is the loop that makes the testing tier compound. Each QA pass leaves the suite stronger than it found it. The agent isn't replacing the test framework, it's feeding it.

Marking what must be re-run after a fix

When a bug is fixed and the agent re-runs verification, it doesn't run the entire suite from scratch (slow). It identifies the minimum set of tests affected by the fix:

git diff HEAD~1 --name-only | xargs -I {} pnpm test --related {}
pnpm playwright test --grep "@touched-by-fix"

Then it runs those, plus the test that originally reproduced the bug, plus any test the fix's diff touches. The agent writes the re-run list explicitly in the report, "I ran these 14 tests after the fix, here's why these and not the other 480." When the report lands, I can see exactly what was verified and what wasn't.

For high-risk fixes (security, financial, auth), the agent re-runs the full suite. For surgical fixes (typo in a string), it runs the targeted set. The classification is part of the bug entry itself.


The API QA Agent — Same Shape, Different Hands

The API QA agent at personas/qa-backend.md follows the same 8-phase shape, but with different tools: curl, jq, psql, mongosh, redis-cli. No browser.

The first thing the agent does, before any test, is confirm the URL mount points:

grep -n "app\.use" src/index.ts

This sounds trivial. It isn't. The single most common QA agent mistake is assuming admin routes mount at /api/v1/admin when they actually mount at /admin. The agent always confirms before testing.

The per-feature reading order

For every feature the agent tests, it reads the source in this order:

# 1. Schema/validation — ground truth for field names and enums
cat src/features/hospitals/hospital.schema.ts
# 2. Service — what it returns
cat src/features/hospitals/hospital.service.ts
# 3. Repo — what DB fields are selected/excluded
cat src/features/hospitals/hospital.repo.ts
# 4. Controller — what response shape is built
cat src/features/hospitals/hospital.controller.ts
# 5. Routes — paths, methods, middleware
cat src/features/hospitals/hospital.routes.ts

This is the order that catches drift. The docs lie. The code is truth. The schema is the most reliable starting point because field names there are checked at runtime by Zod, not just at compile time.

Plain Node fetch, no test framework

The test script is plain Node ESM with fetch, no test framework. From backend-qa-agent.md:

const BASE = 'http://localhost:8085/api/v1';

async function request(base, path, { method = 'GET', body, token } = {}) {
  const headers = { 'Content-Type': 'application/json' };
  if (token) headers['Authorization'] = `Bearer ${token}`;
  const res = await fetch(`\({base}\){path}`, {
    method, headers,
    body: body ? JSON.stringify(body) : undefined,
  });
  let data;
  try { data = await res.json(); } catch { data = null; }
  return { status: res.status, data };
}

let passed = 0, failed = 0, blocked = 0, skipped = 0;
const failures = [];

function pass(id, label)         { console.log(`  \({id}: \){label}`); passed++; }
function fail(id, label, reason) { console.log(`  \({id}: \){label}\n    -> ${reason}`); failed++; failures.push({id,label,reason}); }
function block(id, label, reason){ console.log(`  \({id}: \){label} [BLOCKED: ${reason}]`); blocked++; }
function skip(id, label, reason) { console.log(`  \({id}: \){label} [SKIP: ${reason}]`); skipped++; }

The four-state framework: PASS / FAIL / SKIP / BLOCKED — is the same as the web QA agent. BLOCKED means "prerequisite broken." Never PASS.

A real test:

await test('A-HP-01', 'Register new user returns 201 with user+tokens', async () => {
  const res = await post('/auth/register', {
    email: '[email protected]',
    name: 'Test A-HP-01',
    password: 'Pass123!',
  });
  assertStatus(res, 201);
  const d = res.data.data;
  assert(d.user?.id, 'user.id present');
  assert(d.tokens?.accessToken, 'accessToken present');
  assert(d.tokens?.refreshToken, 'refreshToken present');
});

await test('A-EG-01', 'Register duplicate email returns 409', async () => {
  const res = await post('/auth/register', { /* existing email */ });
  assertStatus(res, 409);
  assertEqual(res.data.error?.code, 'conflict');
});

The rules for writing these tests (verbatim from the skill):

Always use fresh tokens. Never hardcode a token. Login at bootstrap time. Always propagate IDs. Create a resource, capture its ID, use it in dependent tests. If creation fails, block() all dependents explicitly. Never swallow 204 body parsing. Quote the actual response in failures don't just say "got 400", say got 400: {"error":{"code":"validation_error","message":"..."}} Use Date.now() for unique slugs. Hardcoded unique values get 409 conflicts on the second run.

The full how-to is in docs/how-to-use-qa-agents.md.


The Security Agent

The security agent is the most senior of the QA personas. Its identity is "you are a security engineer auditing this codebase for the kinds of bugs that show up in postmortems six months from now." It loads a dedicated security skill (skills/security-review.md) plus the universal hard-lessons.md, and it runs against whatever surface I point it at: a feature branch, a specific file, a full module, the entire backend.

It works the way the other QA agents do: source audit first, then live execution, then a structured report. The differences are in what it audits, what it executes, and how it grades severity.

What it audits (statically)

A series of grep + reading passes against the source, looking for known security anti-patterns:

Password and credential handling.

  • bcrypt usage flagged: the workspace default is Argon2id (memoryCost: 64MB, timeCost: 3, parallelism: 1, tuned to ~200ms per hash on production hardware).
  • Any plaintext password storage or logging.
  • Any password comparison that isn't constant-time.
  • API keys, JWT secrets, or webhook secrets stored anywhere except environment variables.

Token and session handling.

  • Access tokens longer than 15 minutes flagged for review.
  • Refresh tokens stored without server-side sha256(token) indirection.
  • Refresh token rotation missing, every /auth/refresh must invalidate the old token and issue a new one.
  • Refresh token reuse detection missing, a revoked refresh token presented again must revoke all sessions for that user.
  • Sensitive actions (change email/phone/password, delete account, withdraw above threshold) without a fresh OTP gate.

HTTP and authz surface.

  • Async route handlers without asyncHandler (unhandled rejection risk).
  • Routes missing auth middleware (compared against the project's public-route allowlist).
  • Routes missing role-check or ownership-check middleware.
  • Route registration order, specific paths must precede parameterized paths (catches the /me vs /:userId shadowing class of bug).
  • Any service that accepts req as a parameter (HTTP leaking into business logic; obscures authz reasoning).

Validation and input handling.

  • z.any() in Zod schemas, bypasses validation entirely.
  • Missing validation middleware on POST/PUT/PATCH routes.
  • SQL string concatenation in repositories, parameterised queries only.
  • File upload endpoints without size limits, MIME type checks, or content-type validation.
  • SSRF risk: any handler that fetches arbitrary user-provided URLs.

Rate limiting and abuse vectors.

  • Auth endpoints (/login, /register, /forgot-password) without per-IP and per-identity rate limits.
  • 429 responses without Retry-After headers.
  • Forgot-password endpoint that returns different responses for existing vs non-existing emails (enumeration leak).
  • Login lockout missing after N failures.

Webhook and signature verification.

  • Webhook handlers without HMAC signature verification.
  • HMAC comparison with === instead of crypto.timingSafeEqual (timing attack risk).
  • Missing replay-attack protection on webhooks (no event_id UNIQUE constraint).

Financial and money handling.

  • Money fields stored as number / float / DECIMAL instead of bigint kobo/cents.
  • Floating-point arithmetic on monetary values.
  • Wallet ledger missing append-only enforcement (UPDATE/DELETE on wallet_entries).
  • Missing reconciliation check between cached balance and ledger sum.

Logging and PII.

  • Loggers without redaction config: must redact req.body.password, req.body.otp, *.password_hash, *.refresh_token, authorization, BVN/SSN/national-ID fields, full PAN.
  • Error stack traces returned to the client in production.
  • Console logs left in production code.

Storage and client-side concerns.

  • Authentication tokens in localStorage (must be HttpOnly cookies).
  • Any sensitive data in localStorage or sessionStorage unencrypted.
  • Missing X-Frame-Options, Content-Security-Policy, or Strict-Transport-Security headers (Helmet config audit).

Supply chain.

  • .npmrc missing minimum-release-age=10080 (the 7-day release-age guard against day-zero supply chain attacks).
  • npm audit output parsed; high/critical findings filed as bugs.
  • Direct git dependencies ("package": "github:user/repo") flagged for review.

What it executes (dynamically)

After the static audit, the agent runs targeted live tests against the running server:

Auth matrix. For every protected endpoint: no token → 401; expired token → 401 with code: token_expired; valid token wrong role → 403; refresh token reuse → 401 and all sessions revoked; token after account disable → 401 (tests tokenVersion invalidation).

Authz fuzzer (in-progress, see "what's next"). For every endpoint that takes a resource ID, the agent attempts the request with a valid token belonging to a different user. Expected: 403 or 404 (whichever the project's convention says). Any 200 is an IDOR finding filed at P0.

Rate limit storm. For each rate-limited route, the agent fires N+10 requests in a tight loop. Expected: 429 after N requests, with Retry-After header and X-RateLimit-Remaining: 0. Missing headers or unbounded responses are filed as findings.

Idempotency triple test. For each endpoint that accepts Idempotency-Key:

  • First call with key K → creates resource
  • Second call with key K, same body → returns identical response, no duplicate DB record
  • Third call with key K, different body → 422 idempotency_mismatch

The agent verifies all three states in the database with psql / mongosh and checks Redis for the cached idempotency key.

Webhook replay. For each webhook endpoint, the agent re-fires a previously-processed event. Expected: 200 with no side-effect re-execution (idempotent), and the event_id UNIQUE constraint catches the duplicate.

Financial ledger reconciliation. For any wallet/ledger operations, the agent runs the reconciliation query:

SELECT
  (SELECT balance_kobo FROM wallet_balances WHERE wallet_id = $1) AS cached,
  (SELECT SUM(amount_kobo) FROM wallet_entries WHERE wallet_id = $1) AS ledger_sum;

Cached and ledger_sum must match. Any divergence is filed at P0.

Enumeration check. For /forgot-password and similar endpoints, the agent submits both a known-existing and a known-non-existing email and compares the responses. Must be identical (same status, same body, same timing within ~50ms).

What it produces

A structured report with findings tagged by severity:

Severity Meaning
P0 Active vulnerability. Credentials exposed, authz bypassed, IDOR confirmed, money math broken, secrets leaked. Block the deploy.
P1 Latent vulnerability. Missing rate limit on auth endpoint, refresh token rotation broken, webhook signatures not timing-safe, PII in logs. Fix before next release.
P2 Hardening gap. Missing CSP header, missing security headers, weak password parameters. Fix this sprint.
P3 Code quality with security implication. any types in auth code, missing input validation that the schema layer is currently catching but shouldn't have to. Backlog.

Every finding includes: file + line, observed behaviour, expected behaviour, root-cause hypothesis, and a suggested fix specific enough that the dev agent can act on it without asking questions.

What it has actually caught

Sanitised postmortems from real audits:

CI credential leakage: a webhook test passed locally with real production credentials and failed in CI with the network blocked. The credentials had been in CI environment variables for three weeks. Rule installed: external service credentials are never in CI env vars. All third-party services are stubbed in test environments.

HTTP status info leak: /auth/verify-otp was returning 404 for "OTP not found" and 410 for "OTP expired" different responses leaked whether the OTP was ever generated. Fix: collapse to a single 410 otp_invalid regardless of cause.

Service/HTTP coupling enabling an authz hole: a service method was accepting req to read user_id. A test was passing a forged req to the service in unit tests, which masked the fact that the production controller wasn't actually enforcing the auth check the test was relying on. Rule: services must never accept req. Use requestContext.getStore().

Route shadowing: /api/v1/me returning 404 because /api/v1/:userId was registered first. The auth middleware was on the parameterized route, not on /me, so /me was accidentally public. Fix: register specific routes before parameterised ones; the audit now greps every router file for ordering.

Float-precision money bug: a calculation produced 99.99999999998 instead of 100. The bug surfaced in a balance reconciliation diff. Rule: all monetary values are bigint kobo or cents, branded with a Kobo type that prevents accidental conversion to number.

Webhook signature timing attack vector: webhook signature comparison was using === instead of crypto.timingSafeEqual. Theoretical risk, but the agent catches it as a P1 regardless. Fixed across every webhook handler.

How it composes with the other QA agents

The security agent doesn't replace the API QA agent: it runs alongside it. The API QA agent verifies functional correctness. The security agent verifies that the same endpoints can't be abused. Both file findings into the same report format with the same severity scheme.

When both agents are run on a feature branch, the merged output is the security posture of the change. The dev agent doesn't merge until both reports are clean (or the findings are explicitly accepted with a comment in the rules-lessons doc explaining why).

The full spec for what the security agent loads and runs is at skills/security-review.md. It's the most opinionated skill in the repo because security is the area where being right matters most.


The Two-Agent Design System

I ship design systems for the products I build. About 28 components each, full token systems, real scenes, the works. The output usually looks like the kind of thing a small in-house design team would produce after two months: except I do it solo, in a couple of days, with an AI pipeline.

The pipeline is two slash commands with non-overlapping jobs:

Agent Slash command Role What it writes
Designer /design-system-agent Picks a stance, runs discovery, builds an HTML spec with real scenes design-system/projects/<slug>/ — HTML scenes + _foundation.css + variations + thumb
Shipper /ship-design-system Translates that finished spec into a real React component library inside a target repo Target-repo src/.../ui/<component>/<component>.tsx, extends globals.css + tailwind.config.ts, plus a migration doc

Before getting into how they work, the question that comes up first.

Why an Agent and Not Just Stitch (or v0, Lovable, Bolt)?

I use Stitch. I covered it in Part 1. It's excellent at producing a single screen from a detailed brief. v0, Lovable, Bolt, similar tools, they're all good at the same thing: take a prompt, produce a beautiful-looking screen, ship.

That's not what a design system is.

A design system is one stance applied consistently across 28 components and 5+ real surfaces. A button that matches the input that matches the card that matches the table that matches the empty state. The same accent color, the same border radius logic, the same shadow scale, the same typography ramp, composed deliberately so that any combination of components looks like it belongs in the same product.

Stitch and the rest can produce a beautiful login screen. Ask them to produce a consistent login screen + dashboard + settings page + onboarding flow + empty states + the critical "delete account" modal, all in the same visual language, and you'll spend the rest of the week reconciling differences between outputs.

The design system agent solves a different problem: commit to one visual stance and apply it ruthlessly across an entire system. It picks one stance (from a catalogue of 25), runs a structured discovery to understand the product, and then builds every component, every state, every scene against that one stance. The output isn't a single mockup. It's a complete system that survives composition.

The other difference: the agent's output is built to be shipped, not just looked at. Stitch produces an image or a code snippet for one screen. The design agent produces a folder of HTML scenes (with foundation CSS, design tokens, all states, all variants) that the shipper agent then translates into a real React component library inside a real codebase. The chain ends with merged PRs, not screenshots.

The Designer's Discipline

The designer agent operates under a small set of rules that are non-negotiable. These exist because every one of them is something I learned by getting it wrong first.

One stance, never blended. The first version of the agent let me say "modern flat with a touch of editorial and some Bauhaus accents." The output was generic. Indistinguishable from the default AI aesthetic. The fix was a hard constraint in the system prompt:

If you find yourself reaching for "modern flat with a touch of editorial and some Bauhaus accents," stop. Pick one stance. Ship one stance. The user will reward conviction.

The catalogue has 25 stances:- Bauhaus, Editorial Broadsheet, Brutalist Dossier, Japanese Minimalism, Soviet Poster, Surgical Paper, Risograph Print, Glassmorphic Studio, Industrial Telemetry, Cartographer Atlas, and others. Each one is a complete worldview: typography choices, color logic, geometric language, motion language. The agent picks one and commits. Half-baked blends produce half-baked output.

Scenes, not catalogues. Most design system docs show components in isolation: here's a button on a white background, here's an input in the default state, here's a card with placeholder content. The designer agent rejects that. State variants are rendered inside real situations, the loading state of a transaction confirmation, the error state of a payment retry, the empty state of a new account's transactions tab. You can't judge a button until you see it inside the screen where it'll actually appear.

No Bootstrap defaults. If the accent is Tailwind blue or the neutral is Slate, the agent has failed the brief. The constraint is in the prompt because it's the easiest place to drift. The default Tailwind palette is fine. It's also what every AI-generated UI defaults to, which is why all AI-generated UI looks the same. The agent must pick palettes that belong to this product, not to "modern SaaS aesthetic."

The CRITICAL modal is mandatory. Every system has at least one irreversible action, delete account, cancel subscription, transfer funds, archive case file. Most design systems skip designing the modal for it, because the happy path is more fun to draw. The agent doesn't get to skip it. The CRITICAL modal is designed as part of every system.

Speak plainly. The user (me, or whoever invokes the agent) is treated as non-technical. No "design tokens," "atomic components," "semantic colour pairs" without translation. The agent's job is to produce design judgment, not to lecture about design vocabulary.

The Five-Act Flow

The designer agent runs a structured flow, not a freeform chat:

Act I: Discovery. 8-10 questions about the product. Skip anything the brief already answered. Output: a short discovery doc captured in design-system/notes/<slug>/discovery-<date>.md.

Act II: Style proposal. Three stances from the catalogue, each with one paragraph explaining why this stance fits this product. Not "Bauhaus is a clean minimal style" — "Bauhaus fits because your product is about precision in financial decisions, and Bauhaus's geometric clarity reinforces that posture."

Act III: Variation pick. The most important act. The agent builds a single _variations.html file showing A/B/C visual variations side-by-side for a handful of key surfaces (the dashboard, the form, the list, the empty state). I pick one. Or I say "A for everything except the list row, give me three alternatives just for that." The agent iterates until the foundation is locked.

Act IV: Build. In order: foundation CSS first (tokens, type scale, geometry, motion), then primitives (buttons, inputs, badges, selects), then data display (tables, cards, charts), then 3+ named surface scenes, then overlays (modals, toasts, tooltips, the CRITICAL modal).

Act V: Register. The agent appends the project to projects.json, mirrors to the inline window.__PROJECTS__ for the Studio gallery, and builds a thumb.html so the project shows up in the index.

The flow exists because design systems built in freeform chat drift. The five acts give the human (me) discrete review checkpoints: discovery, stance pick, variation pick, build progress, registration. I can intervene at any one. I don't have to discover three days later that the agent chose a stance I never approved.

Why the Designer Output Looks Like a Designer Made It

This is the part most AI design tools get wrong, and the part the agent gets right:

Real content, not placeholder. Every scene uses plausible names, varied entities, believable values, earned non-round numbers. Actual names that vary in length, actual transactions with values like ₦48,250 and not ₦1,000,000.00, actual dates that are recent and sensibly spaced. The visual rhythm of the design depends on real content, because real content is what the design actually has to hold.

Type does work. The agent picks a typography system where serif does the thinking ("Your monthly summary"), humanist sans does the chrome ("Settings · Notifications"), and mono does the record numbers (account IDs, transaction references). Three faces, three jobs, clear hierarchy. Not one face doing everything.

Numbers shout. When the most important thing on a screen is a number, the agent makes that number visibly the loudest object. Bigger than the heading. Sometimes by a lot. The eye should land on it first without thinking about it.

Red is reserved. Critical red is for irreversible actions and life-threatening conditions only. Amber for everything else that's "bad but recoverable." This single rule prevents the design from looking like a constant emergency.

Hairlines, not shadows. When in doubt, draw a line. Shadows have become AI design's default decoration; they look fine at first and tired by week two. Hairlines age better.

These rules live in the agent's persistent memory (design-system/notes/preferences.md), which the agent reads at every session start. They were learned by doing, the first attempt at the medcord design system was rejected by me with the words "AI garbage, most AI shit ui design I've ever seen." The second attempt, after the stance discipline was added, got "shit this is goooood." That delta is the value of the rules.

The Shipper's Discipline

The shipper has one hard rule, stated in four separate places in its system prompt because it's the rule the agent most wants to violate:

Never invent design, only translate.

The shipper takes the HTML the designer built and produces a real React component library. It does not improve, embellish, or "modernise" what the HTML shows. If the HTML has six button variants, the React lib has six button variants. If the HTML doesn't show a hover state on the cards, the React lib doesn't add one. If a prop has multiple reasonable shapes (controlled vs uncontrolled, portal vs inline), the shipper surfaces the question, doesn't guess.

The other rule: don't fight the repo. The shipper calibrates against the target codebase before writing anything. It detects the framework (React/Vue/Solid/Svelte), the version, the styling system, the path aliases, the file layout, whether components use named or default exports, where cn is imported from. If the existing convention contradicts the shipper's preference, theirs wins. The new components must look like they belong in the repo, not like they were dropped in from a different project.

This sounds obvious. It's the easiest thing for an AI to violate. Without the discipline, the shipper writes "improved" code that nobody asked for and nobody can review against the spec.

The Six-Act Ship Flow

The shipper runs its own structured flow, with checkpoints that prevent it from running away:

Act 0: Silent reads. Reads the Studio project's foundation CSS, the key HTML scenes, the migration guide. No output.

Act I: Calibrate (read-only). Detects everything about the target repo: framework, version, styling, conventions. Reports findings. Stops. Waits for me to confirm or correct the detected conventions before doing anything else.

Act II: Discovery. About 8 questions, each with a calibration-inferred default. Press enter to accept. Questions cover: component directory, naming, prop conventions, controlled vs uncontrolled defaults, portal preferences.

Act III: Component plan. Writes a structured plan listing all ~28 components grouped by category. Stops. Waits for me to say "go" before writing any code.

Act IV: Generate. In order: tokens (extend globals.css), Tailwind config extension, cn util if missing, then components in checklist order with a checkpoint every 5 components. After each component is generated, it is immediately added to the preview/viewer page with all its props, variants, and states shown.

Act V: Wire up. Asks before any pnpm add or npm install. Writes a migration doc at <target>/docs/<slug>-MIGRATION.md explaining what was added and any follow-up steps.

Act VI: Notes & report. Appends notes/<target-repo-name>/shipped-<date>.md capturing what shipped, what was deferred, and any lessons learned.

The act structure is the safety mechanism. Every act ends with either a checkpoint (waiting for me) or a permission gate (asking before mutating). The shipper cannot run away, there are at least three places where the human is the one who unblocks the next step.

The Incremental Preview Lesson

The most important operational rule in the shipper, learned the hard way:

When building a component library, after building each single component the agent must immediately add it to the preview/viewer page, with all its props, variants, and states shown as samples, then move to the next component.

Not "build all 28 components, then wire all the previews at the end." Per-component preview, sequentially.

The reason: the preview is where I review. A component that isn't in the preview yet is invisible to me. Batching previews to the end leaves a long blind stretch where work piles up unreviewed, and by the time the previews land, I'm reviewing 28 components at once with no ability to course-correct any single one.

The "no sub-agents" qualifier follows from the same logic. Parallelising component generation across agents would re-introduce the blind stretch (I can't watch two threads at once), plus it removes my ability to scroll the main-thread transcript and audit step by step. The constraint is on observability, not on speed. The shipper could go faster with parallel agents. It would also become impossible to trust.

Permission-Gated Mutations

The shipper has explicit guardrails on what it can change without asking:

  • Never write a file before I say "go" in Act III. Calibration is read-only.
  • Never overwrite an existing file without showing the diff first. This applies to tailwind.config.*, globals.css, package.json, main.tsx. Diff, then ask.
  • Never run shell side effects (install, format, test) without asking. Show the exact command, wait for confirmation.
  • Never skip git hooks or signing. Even if pre-commit fails, fix the cause; don't bypass.

These exist because the shipper is operating inside my real codebase. A mistake at this layer isn't a bad Figma export, it's a corrupted Tailwind config or a botched component file that breaks the app.

What This Produces

A complete design system shipped to a real React codebase in about two days, end to end:

  • ~28 components, all in the same visual stance
  • Design tokens wired into the target's Tailwind config
  • A preview/viewer page showing every component with every variant and state
  • A migration doc explaining how to consume it
  • Persistent memory entries in notes/<target-repo-name>/shipped-<date>.md capturing what was built and what was deferred

The components look like a designer made them, because the design discipline lives in the designer agent, the implementation discipline lives in the shipper agent, and neither one is asked to do the other's job.

The full how-to: docs/how-to-use-design-system.md. Both system prompts and the act-by-act flow are open source.


Multi-Agent Orchestration: When to Fork, When Not To

I do not treat sub-agents as the default. The pattern across my work is single-thread + persona swap, not parallel agent swarms except when demanded:

Multi-agent / persona handoff is allowed when

  • Work crosses a discipline boundary with a hard artifact at the seam: backend → backend QA → frontend → frontend QA. Each persona produces a structured handoff doc. The handoff is the shared context.
  • Long-running unattended creative work (e.g. a batch of demo films). Pre-decided fallback ladders so the agent has authority to keep moving without me.

The nuance: sub-agents are fine when the user is absent and the artifact-at-handoff is well-defined. Sub-agents are forbidden when the user is watching or when shared mutable state (browser session, preview page, live-edited repo) can't survive a fork.

How agents hand off

Three layers, in priority order:

Layer 1: Handoff documents (canonical). Every persona has a templated handoff format. The fullstack persona ships two distinct templates (Frontend QA + Backend QA) plus a Contract Drift Checklist that runs at the seam: Zod schema field names match frontend TS type field names exactly, nullable fields match, pagination shape matches, money fields are integers, dates are ISO 8601 strings, empty arrays are [] not null, error handlers check error.code not error.message.

Layer 2: Persistent memory graph. In Claude Code, ~/.claude/projects/.../memory/ stores cross-linked memory files. Memories named for what triggered them (feedback-ship-preview-incremental, agent-browser-demo-gotchas). I expect the AI to "traverse memory graph-style, not read each file in isolation."

Layer 3: Project guides as a read-order. Project-level agent-handoff.md files are addressed "to the next AI agent... read this once and operate at the same quality bar from the first message" and prescribe a numbered must-read order through the project's docs. Handoff is not a chat, it's a curriculum.

Anti-pattern: plain-text "here's what I was doing" handoffs. Everything is structured. Always.

The browser is the source of truth

The orchestration safety net:

  • QA frontend persona never reports PASS without browser verification.
  • AI tools lie about success. The agent-browser memory explicitly says: "reports success but the file lands elsewhere." Mitigation: verify state, don't trust default success signals.
  • The reload-to-verify-persistence step in the standard QA loop is the circuit breaker. After every mutation, reload and re-read the DOM.

When the AI's report disagrees with the browser, the browser wins.


The Demo Director: AI for Things I'm Bad At

A meta-pattern: I build agents to fix my own weaknesses.

The longest persona in the repo is demo-director.md at ~27KB. It's the persona for the work I don't know how to do, marketing demos and launch films. The persona file is rich because the gap between my native ability and the required output is wide. Personas grow proportional to the gap.

The Demo Director's identity (verbatim):

You are the Demo Director. Your job is to make a product look as good as it actually is, and, when the moment calls for it, to make it look like a launch film. You are a product-marketing-minded frontend engineer with a cinematographer's eye and a motion designer's hands. You do not build features — you reveal them.

Your taste is clean, cinematic, art-directed, honest, and luxurious.

The hard-won rule

The naive approach, drive the live app and agent-browser record it produces a dashcam, not a film: it captures every dead moment while the agent thinks, has no authored pacing, no animations between states, no pointer, no zoom, no callouts, no audio. Do not make films this way. It was tried and it sucked.

Instead, films are rendered deterministically. The renderer is Remotion (React → MP4).

The architecture

agent-browser =  RECON: scrape the real app's markup + computed styles + screenshots + real data

React (Remotion) =  REBUILD the UI faithfully as self-contained components (no cross-tree imports)
                     ANIMATE with Remotion interpolate/spring (+ Framer Motion where its ergonomics help)

Director Kit = reusable primitives: <SceneCard> <Caption> <Pointer> <Spotlight> <ZoomTo> <Reveal>

script.ts = the screenplay: ordered scenes, durations, copy, pointer paths, highlight targets, audio cues

Remotion render = frame-perfect MP4; audio muxed in; then ffmpeg for MP4 = GIF if needed

The three data strategies (always present all, recommend one, ask before building):

  1. Use the product's real/built-in data. Most honest path.
  2. Mock at the network layer: agent-browser network route "**/api/..." --body '{...}'.
  3. Seed real data via the app's own flows/seed scripts, then drive the real UI.

The guardrails:

Films are rendered, never screen-recorded. Honest, always. Dramatization and set dressing allowed; fabricating features is not. Recreate from the real thing, not from memory. Retina or it didn't happen. Stills at scale 2. Films at 2x density / 1080p+. No test test data, ever.

Deeper dive in a future article. The point for now: the persona that codifies the work I'm bad at is the most elaborate persona in the repo. That's the meta-pattern, personas grow proportional to the gap between your ability and the work.


Self-Built Skills: agent-browser, Director Kit

Two skills I built when no existing tool fit.

agent-browser

A CLI tool that exposes a persistent Chromium daemon as bash commands. It's the foundation of the QA frontend agent and the Demo Director's recon step.

Why I rely on a tool of this shape: Playwright, Cypress, and Puppeteer are built for humans writing test files in their DSL. That's the right shape for a regression suite. It's the wrong shape for an AI agent doing exploratory testing in the middle of a conversation.

agent-browser lets the AI agent drive the browser in plain Bash. The agent thinks in shell commands, not in test framework idioms. That made the QA agent's prompts dramatically shorter and meaningfully more reliable — and it doesn't conflict with Playwright at all. The agent uses Playwright for the regression suite it's running, and agent-browser for the exploratory work it's doing on top.

The full reference is in skills/agent-browser.md. The field guide for QA usage is in skills/agent-browser-qa-guide.md.

The Director Kit

A set of React primitives the Demo Director composes into scenes: <SceneCard>, <Caption>, <Pointer> (bezier paths + click pulses), <Spotlight> (dim to focus), <ZoomTo> (camera moves), <Shape>, <AudioCue>, <Reveal>, <BrowserFrame>.

Each primitive composes deterministically against Remotion's frame clock. The Demo Director persona doesn't have to think about animation math; it composes primitives, and the math is correct by construction.

The Director Kit isn't open-sourced separately yet. The patterns it encodes are in the Demo Director persona.


What I'd Build Next Honest gaps in the current system, in order of impact:

  1. A landing-page persona. The chain has a hole between Demo Director output and the actual landing page repo. Right now I build landing pages ad hoc. Codifying the pattern (which Director assets go where, what copy structure converts, how the AI-discoverability layer plugs in) would close it.
  2. A "fresh repo onboarding" agent persona. "I'm a new agent in a Feranmi project, what do I read in what order", currently project-specific via agent-handoff.md files. A universal version would reduce session-start friction.

The Honest Conclusion The "AI as force multiplier" narrative is real. But it works for me because I've been bitten enough times to know what to encode. It's the practice (bug → rule → spec → next agent inherits it). The files are just where the practice lives.

The files are at github.com/spiderocious/agentic-workflow. Fork it. Edit it. Ignore what doesn't apply. The whole point of the modular structure is to make that easy.

What I hope this two-part series did is show one specific way to use AI seriously in production, not as a toy, not as a magic box, but as a contractor with a reference library of your team's lessons. The reference library is the work. The AI is the means.

The AI doesn't make you a better engineer. It executes the engineer you already are, at higher throughput. Encode the engineer you want to be.

Bye.