Skip to content

AI agent audit: ten checks before we write a line of code

An operations lead emails us asking for an AI agent that handles support mail. We don't reply with a quote. We reply with a ten-question audit.

Jacob Molkenboer
Jacob Molkenboer
Founder · A Brand New Company
Published
13 May 2026
Reading time
8 min read
Category
Strategy
Brass loupe beside cream checklist with green sticky note, paperclip, linen napkin, red wax seal on ivory paper.

The operations lead at a Rotterdam logistics firm emailed on a Monday: "We need an AI agent to handle our support inbox. Can you scope it?" Two paragraphs, no attachments, no example tickets. We did not reply with a quote. We replied with a 40-minute call and a shared doc with ten questions. By Friday we had a written yes or no, with a reason. That doc is the audit. It is the most valuable hour of work in the whole engagement.

Most agent projects fail at this step, not at the build. The build is the easy part. The deciding-what-to-build is where the budget gets burned, and where the agent eventually does something embarrassing in production six months later. So we wrote down what we look at, every time, before any agent ships.

What the audit is, what it isn't

It is not a discovery workshop with sticky notes and printed personas. It is a ten-question checklist that we run before any agent build, regardless of size. The point is to find the project that should not exist before we charge anyone for it. Three out of ten audits end with "don't build this, fix the underlying process first." That is the audit doing its job, not failing it.

The questions below are boring. That is also the point. There has been a flood of AI agent content this year, and a developer wrote on r/webdev this week that they mass-unsubscribed from every AI newsletter and their brain finally worked again. The audit is built for the readers in that thread: people who already know the demos and just want to know whether a build is worth the money.

Volume and repeatability

The first cut is the cheapest: how many times per week does the work happen, and how similar are the instances?

Under 20 instances per week, with significant variation between them, and the answer is almost always no. Not because an agent cannot do it. Because the cost of building, monitoring, and maintaining the agent will outrun the hours saved for years. We have watched founders spend €15k automating a task their office manager does in 90 minutes a week. The math never recovers.

We hand clients a one-line check they can run on a mail archive export:

find ./inbox -name "*.eml" -print0 \
  | xargs -0 grep -h "^Subject:" \
  | sed 's/Re: //gi; s/Fwd: //gi' \
  | sort | uniq -c | sort -rn \
  | head -50

We want to see the long tail collapse. If 40% of inbound mail is one of five subject patterns, we have an agent project. If the top 50 patterns each represent 2% of volume, we don't, and we say so out loud.

The data access reality

Half of agent projects die here. The client wants an agent that reads from System A, writes to System B, and cross-references CRM C. We ask for API keys. System A has no API. System B requires a SOAP login that hasn't been used since 2019. The CRM is a Notion workspace where one person has admin and that person is on parental leave.

So the audit forces us to write down, for every system the agent will touch, a five-line spec: read or write, transport (API, IMAP, scraping, database, file drop), rate limit, auth mechanism, and who controls credential rotation. If three of those answers come back as "we'd have to scrape it," the project is a back-end rebuild first, then maybe an agent. We tell the client. Sometimes they thank us. Sometimes they hire someone else who will scrape, then call us six months later when the scraper breaks.

The decision boundary

This is where most agent demos quietly cheat. The demo agent in the YouTube video makes decisions because the consequence of being wrong is "the demo looks bad." Your production agent makes decisions where being wrong sends an invoice to the wrong client, replies to a journalist with a hallucinated quote, or refunds €4,000 to a customer who hasn't asked for a refund.

So we draw a line. On one side, things the agent does unsupervised. On the other, things it drafts for a human to approve. The line is written down, signed off, and reproduced verbatim in the agent's own system prompt.

Warning

If the cost of being wrong on a given action is higher than the cost of one human approval click, that action belongs behind an approval gate. No exceptions, including for clients who insist they trust the model.

Source of truth and the write path

Where is the canonical answer for each piece of data the agent touches? If a customer's address lives in three systems and they disagree, the agent will pick one and be wrong against the other two. We need to know which one wins, and we need to write that down before any code is written.

Write paths get extra scrutiny. The audit asks one question: if the agent writes to this system, can a human reverse the write within five minutes? If no, the write goes through a draft-and-approve gate. We have not regretted that rule once.

Observability before it ships, not after

Every agent we ship runs with structured logging on every tool call, every prompt, every response, every decision. Not because we plan to look at the logs, but because the day will come when we have to look at the logs, and that day will be a Tuesday, and the founder will be on a call with their biggest customer.

A reasonable log line looks like this:

{
  "ts": "2026-05-13T09:14:22Z",
  "agent": "invoice-chase",
  "run_id": "rn_01HXY8R3K2",
  "tool": "send_email",
  "input_hash": "sha256:9f2b8c…",
  "decision": "draft_for_approval",
  "approver": "anna@client.nl",
  "latency_ms": 1840,
  "tokens_in": 3204,
  "tokens_out": 412
}

The Voker launch that hit Hacker News this week is about exactly this: agent-specific analytics is becoming its own product category, because logs from a chat model are not logs from a Rails app. You need to be able to replay the prompt, see the tool calls in order, and link each decision to a specific commit and a specific approver. If your agent vendor cannot show you that view, that is a red flag the audit catches before procurement does.

Security and credentials

This is the section nobody wants to read. Read it anyway.

Open the CISA known-exploited vulnerabilities catalog and new CVEs land every week, many of them in the CMSes, plugins, and ERPs that small businesses sit on top of. An agent with credentials to a vulnerable system is a credential-leaking accelerator, not a magic shield. The audit checks where the agent's secrets live, who can rotate them, and what happens if the agent's host is compromised at 3am on a Sunday.

We use scoped service accounts with the minimum permissions to do the job. No shared admin keys. No long-lived OAuth tokens parked in a .env file in a private GitHub repo. (We have seen that. The repo was not as private as the founder thought.) The OWASP Top 10 for LLM Applications covers most of the categories worth checking; we treat it as a baseline, not a ceiling.

Handoff and the kill switch

Every agent we build has an escape hatch. Three things must be true. The agent can recognise that it is outside its competence and stop. The handoff has a named human owner, not a shared mailbox. The handoff carries enough context that the human does not have to redo the work from scratch.

If you cannot name the human, you do not have an agent project. You have a science fair demo.

The kill switch is the same idea, scaled up: one command, one place, one person can execute it. Not a multi-step PagerDuty page. Not "ask the developer who built it." A single line in a runbook that any operations lead can run at 11pm on a Saturday without paging the studio.

The two-week shadow run

The last question before we sign: can we run the agent in shadow mode for two weeks? Same inputs as production, but outputs go to a review queue instead of to customers. We compare the queue against what the humans actually did. If shadow disagrees with humans more than 15% of the time on routine cases, we don't ship. We tune. Sometimes we tune for another two weeks. Occasionally we conclude that the workflow is harder than the audit suggested, and we renegotiate scope.

The r/artificial thread this week titled "an enormous crash just waiting to happen" is mostly about valuations and capex, but the same logic applies one floor down at the agent level. Most production failures we have seen come from agents shipped without a shadow run, sold on the strength of a demo. The audit is the cheap insurance against that pattern.

The smallest thing you can do today

Pick one workflow on your team that someone has complained about three times this quarter. Open a blank doc. Answer the ten questions above for that workflow: volume, repeatability, data access, decision boundary, source of truth, write path, observability, security, handoff, shadow run. If you finish in under an hour, you do not have an agent project yet. You have a process documentation project, which is cheaper and a prerequisite anyway.

When we built the inbox triage agent for that Rotterdam logistics firm, the thing we ran into was that two of their three systems of record disagreed on customer phone numbers in 8% of cases. We solved it by making the agent surface the conflict to a human instead of guessing, then logging the resolution back to a single canonical record. The printable version of this checklist is the same one we use on every AI agent engagement. Email us and we'll send it over.

Frequently asked

How long does the audit take?+
Forty minutes on a call plus a shared doc the client fills in over a week. Total elapsed time is usually under ten working days from first email to a written go or no-go.
What if I fail the audit?+
Most failures are not really failures. They are a sign the underlying process needs documentation or data cleanup before an agent can sit on top of it. That work is cheaper than the build would have been.
Do you charge for the audit?+
The first 40-minute call is free. If the audit turns into a full scoping doc with diagrams and access spec, we quote a fixed fee that we credit back if you proceed with the build.
Why a 15% disagreement threshold in shadow mode?+
It is the level where a human reviewer can spot-check the agent's output without redoing the work. Above that, the agent costs more time than it saves. The exact number is workflow-specific; 15% is a starting line.

Want to build something similar?

Send us one paragraph about the process that eats the most of your week. We'll reply with an honest plan — within 4h on weekdays.