Lead scraping: small-team playbook for spam filters and law

It is eleven on a Tuesday and your sales lead just dropped a spreadsheet on Slack. Two thousand and forty rows. Decision-makers at Dutch trade-construction companies between fifty and three hundred employees. The plan is to push it into a sequencer at nine the next morning and let the responses roll in.

There are two ways this ends. Either Google quietly buries your sending reputation by Thursday, or a complaint to the Autoriteit Persoonsgegevens lands the company a polite but expensive letter eight months later. Often both. We have cleaned up enough of these to write down what actually works.

The legal frame in plain English

The most useful thing to know about cold B2B outreach in the EU is that it is not, by default, illegal. The bad-faith version (scraping LinkedIn at scale, buying lists from a Telegram channel, mailing personal Gmail addresses from a leaked breach) gives the whole category a bad name. Real cold outreach to a named business contact, for a relevant purpose, can be lawful under Article 6(1)(f) of the GDPR, the "legitimate interests" basis.

The catch is that legitimate interest is not a free pass. It is a balancing test. You need a clear purpose, a reasonable expectation on the recipient's side, and a documented assessment that their rights do not override yours. The Dutch DPA's guidance on direct marketing is the clearest summary we have found in any member state. Read it once a year and keep a one-page Legitimate Interest Assessment per campaign in your repo, next to the campaign config.

Two layers sit on top of GDPR. The ePrivacy Directive (article 13) and the Dutch Telecommunicatiewet article 11.7 govern unsolicited electronic communication specifically. For B2B mail to a business address (info@, sales@, or a clearly business mailbox like name@companydomain.nl) the Dutch rule is opt-out: you may send, but you must offer a frictionless unsubscribe and honour it permanently. For personal addresses (Gmail, Hotmail, Outlook) it is hard opt-in. No exceptions, no clever workarounds.

Public does not mean free

"It is on their website" is not a legal basis. The CJEU ruling in Ryanair v PR Aviation (C-30/14) made it clear that a website's terms of use can bind a scraper even where database-right protection does not apply. Most sites you want to scrape have terms that forbid automated collection. That does not automatically make your scraper criminal, but it changes the conversation when something goes wrong.

The practical filter we use is this. Public registries (KVK in the Netherlands, Companies House in the UK, the Belgian KBO/BCE) exist to be queried. Trade-association directories with a robots.txt-allowed listing page are usually fine for small-volume use. Personal social profiles and gated B2B platforms are not. If a site requires you to be logged in to see the data, you are no longer scraping public data. You are circumventing access control, and the conversation moves from GDPR into Wet computercriminaliteit article 138ab territory. Stop there.

A scraping budget you can defend

Most scrapers get blocked because they look like scrapers. They send one request per 200 milliseconds, use the default python-requests user agent, ignore robots.txt, and fetch the same URL pattern in a flat sequence. Even modest WAFs catch this in the first hundred requests.

The frame we hand new operators is a "scraping budget". You decide, upfront, the maximum hourly footprint you will allow yourself on any single domain. For most lead work, fifteen pages per hour per domain is plenty, and looks like a curious researcher rather than a bot. Anything more aggressive needs a written justification, a contact email in the user agent, and ideally an out-of-band note to the site owner explaining what you are doing.

import asyncio, random, urllib.robotparser
from playwright.async_api import async_playwright

UA = "ABN-Research/1.0 (+https://abn.company/contact; research@abn.company)"
BASE = "https://www.example.nl"

def allowed(url: str) -> bool:
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(f"{BASE}/robots.txt")
    rp.read()
    return rp.can_fetch(UA, url)

async def polite_fetch(page, url: str):
    if not allowed(url):
        return None
    # 200-400 seconds per request, jittered, ~15 fetches/hour
    await asyncio.sleep(random.uniform(200, 400))
    resp = await page.goto(url, wait_until="domcontentloaded")
    return await page.content() if resp and resp.ok else None

async def crawl(urls: list[str]):
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        ctx = await browser.new_context(user_agent=UA)
        page = await ctx.new_page()
        for url in urls:
            html = await polite_fetch(page, url)
            if html:
                yield url, html
        await browser.close()

Three details matter more than they look. The user agent identifies you and gives a real contact address that a curious sysadmin can actually reach. The robots.txt check is not optional, it is your written evidence of good faith if anyone asks later. The sleep is long. Fifteen pages an hour feels painful on day one and obvious on day thirty.

Enrich from public registries, not from people

The most defensible enrichment source for Dutch leads is the KVK Open Data API. Every Dutch company has a public KVK number, a registered address, an SBI sector code, and (for many entries) the directors' names and a registered contact address. You can match a website to a KVK record, pull the official contact details, and skip the personal-profile scraping entirely. The same is true of Companies House in the UK and the BCE/KBO in Belgium. None of these need scraping. They have endpoints designed to be queried.

A working pattern: scrape the listing page once to get a company name and URL. Resolve the company to a registry ID. Enrich everything else from the API. You end up with a single scraping touch per company, a complete audit trail of where each field came from, and a list of records that map back to a public document a regulator can also pull up.

The deliverability layer most teams skip

Since February 2024, Google and Yahoo's bulk sender rules have made the legal layer almost a separate problem from the technical one. If your domain does not pass SPF, DKIM and DMARC alignment, and you send more than 5,000 messages per day to Gmail recipients, you are silently being filed under "promotions" at best and "spam" at worst. Google's bulk sender requirements are now the floor, not the ceiling.

The minimum we set up before any list goes out:

A dedicated sending subdomain (mail.companydomain.nl), never the root domain.
An SPF record listing only the sending IPs you actually use.
DKIM with a 2048-bit key, rotated yearly.
A DMARC policy at p=quarantine minimum, with rua reports going to a real mailbox someone reads weekly.
A one-click List-Unsubscribe header following RFC 8058, not just a footer link.

The List-Unsubscribe header is the one most teams forget. Without it, the "report spam" button becomes the only feedback signal Gmail has, and that signal is fatal. We run the suppression list and the send queue as a durable workflow on Postgres so an unsubscribe propagates in seconds, not in the next daily sync.

Warmup and batch sizes that match reality

A fresh subdomain cannot send two thousand messages on day one. It cannot send two hundred. The warmup curve we use, based on what survives Gmail's reputation grader without tripping it, looks roughly like this: week one, twenty messages per day to engaged recipients only. Week two, fifty per day. Week three, one hundred and fifty. Week four, three hundred. Hold there for two weeks, then ramp by fifty per week.

"Engaged recipients only" during warmup means seed accounts you control plus a small number of existing customers who have opted in to occasional product updates. Mixing in cold leads during the first ten days is the fastest way to torch a domain. Two weeks of patience here saves a month of remediation later.

Takeaway

A list of two thousand cold contacts is a six-week sending plan, not a Tuesday-morning blast. Anyone selling you otherwise is selling you the next domain.

Three rules the message itself must follow

None of these are about clever copy. Under GDPR Article 14, you have to tell the recipient how you obtained their details and what you intend to do with them. Two clear sentences in the footer is enough, but they have to be there. Provide a working unsubscribe that processes immediately, not "within thirty days". And keep a single suppression list across all campaigns, not per-campaign. Re-mailing someone who opted out is the most common reason a Dutch DPA complaint becomes an actual fine.

Personalisation matters for response rate, but it matters more for the legal position. A message that references the recipient's role at their actual company, with one sentence about why you reached out specifically to them, looks like considered outreach to both a human and to a spam classifier. A mail-merge to "Hi {first_name}, I noticed your company..." looks like exactly what it is.

Replying to provenance requests in one paragraph

It will happen, maybe three times per thousand sent. Somebody will reply and ask, politely or not, where you got their address. The reply you give in the first hour is the difference between a closed loop and a formal Article 15 access request that costs you a day of legal time. Have the answer pre-written:

"We found your contact details on your company's public website on 14 March 2026. We sent one outreach about [topic] because [reason in one sentence]. We have removed your address from all our lists. If you want to see what we still hold or have it deleted, reply to this address and we will action it within 72 hours."

Keep that template in your shared docs and link to your retention policy. The cost of writing it once is fifteen minutes. The cost of not having it is a half-day scramble per request, plus the awkward email to legal that nobody enjoys sending.

The five-minute audit you can do today

Open your last cold campaign and check three things. First, run your sending domain through MXToolbox and confirm SPF, DKIM and DMARC all pass alignment. Second, open the raw source of one sent message and search for "List-Unsubscribe-Post". If the header is missing, fix that this week, not next quarter. Third, pick one recipient at random and try to reconstruct, from your records alone, where their address came from, on what date, and why you decided they were a reasonable target. If you cannot, you do not have a defensible list. You have a liability that happens to bring in some replies.

When we built the lead pipeline for a Rotterdam staffing client last quarter, this third check was the one that took the longest to fix. The deliverability work was straightforward. The hard part was rebuilding the provenance trail for thirty thousand existing contacts, which is the kind of cleanup we do under our process automation work. Better to start with the trail in place than to reconstruct it under deadline.

Frequently asked

Is cold B2B email legal in the Netherlands?+

Yes, to business addresses under GDPR legitimate interest, with a working opt-out and clear provenance. To personal addresses like Gmail or Hotmail it is strict opt-in, no exceptions.

Can I scrape LinkedIn for B2B leads?+

No. LinkedIn sits behind a login wall, which moves the question from GDPR consent into access-control circumvention under EU and Dutch computer-misuse law. Use public registries instead.

What warmup schedule should a new sending domain use?+

Start at twenty emails per day to engaged recipients only. Roughly double weekly until you reach three hundred per day, then ramp by fifty per week. Cold leads only enter after week two.

What is the minimum DMARC policy for cold outreach?+

p=quarantine on a dedicated sending subdomain, with SPF, DKIM at 2048-bit, and a one-click List-Unsubscribe header per RFC 8058. Anything weaker triggers Gmail's bulk-sender filters.

How do I prove legitimate interest if a regulator asks?+

Keep a one-page Legitimate Interest Assessment per campaign documenting purpose, balancing test against recipient rights, and data minimisation. Store it next to the campaign config, not on a personal drive.

Keep reading

Share:X LinkedIn Email

Vintage brass and wood telephone switchboard panel on ivory paper, six coiled patch cords, a paper form with green sticky note, red wax seal.

Process automation

22 May 2026·7 min read

Process automation for agencies: six tasks before your next PM

Before you spend €5k a month on another project manager, automate the six collection tasks quietly eating their future workload. A field guide for agencies.

process automationautomationworkflow

Read

Brass desk bell, three cream index cards with iron paperclips, one tied with a green ribbon, on ivory paper.

Chat agents

24 Apr 2026·8 min read

Inbox triage for chat agents: three rules before you ship

Before you let a chat agent reply to a paying client, it needs three rules in place. Scope, escalation, and a write-lock. Skip any one of them and you will regret it.

chat agentsai agentsemail automation

Read

Unopened cream envelope on forest leather blotter, green silk ribbon across it, brass paperclip and linen receipt beside.

Email automation

22 Apr 2026·9 min read

Email agent case study: 6 hours of invoice chasing to 20 minutes

A 14-person agency was spending six hours a week chasing overdue invoices. One email agent, three weeks of work, and the ritual now takes twenty minutes.

email automationai agentscase study

Read

Lead scraping: surviving spam filters and Dutch case law

The legal frame in plain English

Public does not mean free

A scraping budget you can defend

Enrich from public registries, not from people

The deliverability layer most teams skip

Warmup and batch sizes that match reality

Three rules the message itself must follow

Replying to provenance requests in one paragraph

The five-minute audit you can do today

Frequently asked

Keep reading

Process automation for agencies: six tasks before your next PM

Inbox triage for chat agents: three rules before you ship

Email agent case study: 6 hours of invoice chasing to 20 minutes

Want to build something similar?