Skip to content

Email agent case study: 6 hours of invoice chasing to 20 minutes

A 14-person agency was spending six hours a week chasing overdue invoices. One email agent, three weeks of work, and the ritual now takes twenty minutes.

Jacob Molkenboer
Jacob Molkenboer
Founder · A Brand New Company
Published
22 April 2026
Reading time
9 min read
Category
Email automation
Unopened cream envelope on forest leather blotter, green silk ribbon across it, brass paperclip and linen receipt beside.

Friday, 9am. The operations lead at a Rotterdam agency opens Xero, exports the aged receivables report to CSV, cross-references it against a Google Sheet of client notes ("don't chase before the 15th", "only email Marco, never Sophie"), then starts writing emails. Polite on the first chase. Firmer on the second. By the third chase she's copying the founder and rewriting the same sentence for the eleventh time that quarter. The whole ritual takes her six hours. Every week.

We replaced it with an email agent. The ritual now takes twenty minutes, and those twenty minutes are spent reading drafts and clicking approve. Here is how we built it, what broke, and what we would do differently.

The client and the actual problem

The agency runs about €3.2M in annual revenue across roughly 40 active clients on retainer, plus project work. Average invoice is €4,800. Average days-sales-outstanding when we started: 47 days. Their target was 30.

The operations lead, Eline, wasn't the bottleneck because she was slow. She was the bottleneck because the work was irreducibly human: every client has a different tone, a different contact, a different payment quirk. One client pays on the 28th of every month regardless of invoice date. Another needs a PDF resent because their finance tool strips attachments. A third is a friend of the founder and needs to be chased by the founder, not by ops.

The brief from the founder was blunt: "Don't build me a robot that sends cold emails. I'll lose clients. Build me something that does the typing and leaves the thinking to us."

The architecture, in one picture

We settled on a human-in-the-loop agent. It reads, drafts, waits. It never sends without approval.

Diagram of the invoice-chase agent: Xero webhook to n8n workflow to LLM draft to Gmail drafts folder to human approval to send
Xero triggers the workflow. The agent drafts into Gmail. A human approves. Only then does the email go out.

The stack, deliberately boring:

  • Xero API for the invoice state, polled every morning at 07:30 Europe/Amsterdam.
  • n8n (self-hosted, on a €12/month Hetzner box) as the orchestration layer.
  • Claude Sonnet via API for drafting. We tested GPT-4.1 and Sonnet side by side on 60 historical chase emails; Sonnet matched Eline's voice more often on the blind test, so Sonnet it was.
  • Gmail API to create drafts (not to send).
  • Postgres for the memory layer: which clients have been chased, when, and what they said back.

No vector store. No RAG. The "knowledge base" is a 340-line YAML file of per-client rules that the operations lead maintains herself. That matters — we'll come back to it.

The agent loop

Every morning the workflow pulls overdue invoices from Xero, joins them against the client rules file, and hands each overdue invoice to the drafting step as a structured prompt.

// n8n Function node: build the drafting payload
const clientRules = $json.rules[$json.clientId] || {};

return {
  json: {
    invoice: {
      number: $json.invoiceNumber,
      amount: $json.amountDue,
      currency: $json.currency,
      daysOverdue: $json.daysOverdue,
      issuedOn: $json.date,
    },
    client: {
      name: $json.contactName,
      preferredContact: clientRules.contact || $json.defaultEmail,
      tone: clientRules.tone || 'warm-professional',
      doNotChaseBefore: clientRules.doNotChaseBefore || null,
      lastContact: $json.lastChaseAt,
      chaseCount: $json.chaseCount,
    },
    history: $json.previousThread || [],
  },
};

The system prompt is where most of the work lives. It's about 900 words. It includes three real examples of Eline's actual past chase emails (anonymised), a strict rule that the agent must refuse to draft if doNotChaseBefore is in the future, and a tone ladder: chase 1 is apologetic and soft, chase 2 is direct, chase 3 escalates to the founder and the agent flags it for his review instead of Eline's.

The agent outputs JSON, not prose:

{
  "shouldDraft": true,
  "recipient": "marco@clientdomain.nl",
  "cc": [],
  "subject": "Factuur 2026-0412 — kleine herinnering",
  "body": "Hoi Marco,\n\nKleine ping...",
  "confidence": 0.88,
  "flagForHuman": false,
  "reasoning": "Second chase, warm tone per client rules. No recent reply on thread."
}

If confidence is under 0.7, or flagForHuman is true, the draft lands in a separate Gmail label called Review-first. Everything else lands in Ready-to-send. Eline clears both in the morning.

What broke on the way

The draft-vs-send problem

Our first prototype sent emails directly. We caught it in staging when the agent, working from stale Xero data, drafted a chase email to a client who had paid the day before. That was the moment the founder's "don't build me a robot" line became load-bearing architecture. We ripped out the send step and replaced it with Gmail's drafts API. Drafts sit in the inbox. A human clicks send.

Warning

If your agent touches money, email, or customers, default to drafts. The hour you save by auto-sending is not worth the one email that goes to the wrong client on the wrong day.

The tone-drift problem

By week two, Eline noticed the drafts were getting subtly worse — more generic, slightly more formal, losing the "hoi" openers she actually uses. The cause was boring: we were summarising past email threads into the prompt to save tokens, and the summaries were stripping the voice. We stopped summarising and started including the last three full emails verbatim. Drafts got good again. Token cost went up about 30%. Still under €40 a month.

The prompt-injection problem

One client replies to invoices with automated bounce-back text from their ticketing system. That text included instructions like "Please forward this to your accounts department". Our agent, helpfully, started drafting forwards to a nonexistent accounts@ address. This is the dull, everyday version of the attack surface documented in the OWASP Top 10 for LLM Applications under prompt injection — your agent doesn't need to be jailbroken by a malicious actor, it just needs to read an auto-responder and take it too literally. We added a sanitisation step that strips quoted reply bodies before they hit the prompt, and a rule that the agent can only draft to the recipient defined in the client rules file, never to an address it invents.

The numbers, honestly

Three months in:

  • Eline's weekly chase time: 6h → 22 minutes average over 12 weeks.
  • DSO: 47 days → 31 days.
  • Drafts approved without edit: 71%. Edited before sending: 24%. Deleted: 5%.
  • Agent API cost: €38/month. n8n host: €12/month. Total run cost: €50/month.
  • Build cost: three weeks of our time.

The 5% deletion rate is the number we watch most carefully. Those are the cases where the agent shouldn't have drafted at all — a client in dispute, a credit note in flight, a conversation we didn't know about. Each one goes back into the client rules YAML.

Takeaway

The agent didn't replace the operations lead. It moved her from typing to judging. That's the whole shape of useful agent work right now.

What we would do differently

We would skip n8n and write the orchestration in plain TypeScript on a cron. n8n was fast to prototype in and painful to maintain once we passed about twelve nodes. The debugging story is worse than a twenty-line script. If you already run infrastructure, don't reach for a low-code tool just because the demo is pretty.

We would also build the rules editor sooner. For the first month, Eline edited YAML directly. She is technical enough to not mind, but a simple web form would have cut the feedback loop in half. We shipped that in month two.

When we built this for the agency, the hardest part wasn't the AI agent itself — it was deciding where the human belongs in the loop. We ended up putting the human at the send button and giving the agent everything up to it. Three months later, nobody at the agency wants to go back to the spreadsheet.

If you have a weekly ritual that looks like this — export, cross-reference, type, send, repeat — spend twenty minutes today writing down every decision you make during it. That list is the spec for your agent.

Frequently asked

Why drafts instead of auto-send?+
Because one wrong chase email to a client who already paid costs you more trust than a year of saved hours earns. Drafts keep a human at the send button where it matters.
Did you use RAG or a vector database?+
No. A 340-line YAML file of per-client rules, plus the last three emails verbatim in the prompt, was enough. RAG is overkill when your knowledge base fits in a file the operations lead can edit herself.
How much did it cost to run?+
About €50 a month: €38 in Claude API usage and €12 for a small Hetzner box running n8n. Build cost was three weeks of our time.
What happened to the operations lead's job?+
She still owns the process. She now spends the saved hours on client onboarding and month-end reporting. The agent took the typing, not the judgement.
Would this work for a larger company?+
The architecture scales fine, but the rules file doesn't. Past roughly 150 active clients you need a proper rules UI and probably per-account-manager routing. Below that, YAML is fine.

Want to build something similar?

Send us one paragraph about the process that eats the most of your week. We'll reply with an honest plan — within 4h on weekdays.