goweb-scrapingarchitecturefirecrawl

Why I rewrote Firecrawl in Go

I got tired of deploying 8 services to self-host a web scraper. So I built a single Go binary that does HTTP-first scraping with Chrome fallback — and it's 32x faster on static pages.

Eason LiuMarch 3, 20268 min read

Last year I was building an AI agent that needed to read web pages. The agent would take a URL, scrape the content, feed it to an LLM, and return a summary. Standard stuff.

I picked Firecrawl because it was the most popular option. The hosted API worked great. Then the bill hit $300/month for what was essentially a side project, so I decided to self-host.

That's when things went sideways.

The self-hosting nightmare

Firecrawl's self-hosted setup requires Redis, PostgreSQL, a Playwright service, multiple Node.js workers, and a handful of other containers. My docker-compose.yml had 8 services. On a $10/month VPS, the whole stack barely fit in memory — Redis alone wanted 200MB, and each Playwright browser instance could spike to 500MB+.

I spent a weekend debugging why jobs were silently failing. Turns out the Redis connection pool was exhausting under concurrent requests, and the worker process was crashing without useful error messages. The community had noticed this too — someone even forked Firecrawl-Simple specifically because self-hosting was so painful.

I don't mean to bash Firecrawl. It's a great product with features I'll never match — recursive crawling, sitemap discovery, webhooks, async job queues. But for my use case (scrape one URL, get clean Markdown), it was wildly over-engineered.

So I wrote my own. In Go.

Why Go?

Three reasons.

Single binary. go build produces one executable. No runtime, no package manager, no virtual environment. Download it, chmod +x, run it. Cross-compile for Linux, macOS, and Windows from my MacBook. The binary is about 15MB. Compare that to Firecrawl's Docker images totaling roughly 2GB.

Low memory. Purify idles at ~30MB of RAM. That's not a typo. A $5 VPS with 512MB RAM runs it comfortably alongside other services. Go's goroutines handle concurrency without spawning heavy OS threads, so 10 concurrent scraping requests might use 80MB total.

No dependency chain. No node_modules, no pip install, no C library linking. The Go standard library handles HTTP, HTML parsing, and JSON encoding. I pulled in chromedp for headless Chrome support and html-to-markdown for Markdown conversion. That's it.

If you've ever SSH'd into a $5 Hetzner box and wished your scraper didn't need 2GB of Docker images, this is the stack for you.

The HTTP-first architecture

Here's what made the biggest difference: most web pages don't need a headless browser.

Firecrawl and Crawl4AI spin up a Chromium instance for every request. That's the safe choice — JavaScript rendering handles everything. But it's also expensive. A headless Chrome instance takes 1-3 seconds to boot, consumes 200-500MB of RAM, and blocks on network requests for ads, tracking scripts, and third-party widgets that you're going to strip out anyway.

I checked my access logs after running for a month. About 80% of URLs my agents scraped were documentation pages, blog posts, news articles, and README files. Static content. These pages render their main content in the initial HTML response. JavaScript might add a cookie banner or load comments, but the article text is right there in the <body>.

So Purify tries HTTP first:

Request → HTTP GET (100ms)
  ├─ Got content? → Extract, return Markdown
  └─ Looks empty? → Launch Chrome, render JS (3s), extract, return Markdown

I call this progressive scraping. The HTTP path returns in about 100ms. The Chrome fallback takes 2-4 seconds. The average across a mixed workload is around 200ms because most requests never touch Chrome.

In my benchmarks against Firecrawl's self-hosted instance on the same hardware:

| | Purify (HTTP-first) | Firecrawl (always Chrome) | |---|---|---| | BBC News article | 89ms | 3,241ms | | GitHub README | 62ms | 2,870ms | | Hacker News | 71ms | 3,102ms | | React SPA (JS-rendered) | 2,890ms | 3,340ms |

Static pages are 30-40x faster. JavaScript-heavy SPAs are roughly equivalent because both tools fall back to Chrome. The difference is that Purify only pays the Chrome tax when it actually needs to.

Caveat: this approach has a failure mode. Some sites serve partial content in the initial HTML and fill it in with JavaScript. Purify's heuristics catch most of these cases and fall back to Chrome, but occasionally a page slips through and you get incomplete output. I'm still improving the detection logic.

Token savings

This is the part that gets AI developers' attention. I measured output token counts with tiktoken on the same 5 URLs:

| Website | Raw HTML | After Purify | Savings | |---------|----------|-------------|---------| | GitHub README | 14,847 | 1,026 | 93% | | BBC News | 32,591 | 1,843 | 94% | | Wikipedia | 28,103 | 3,412 | 88% | | Hacker News | 5,230 | 631 | 88% | | Xiaohongshu | 51,208 | 892 | 98% |

If you're running an AI agent that reads 50 pages per task, this is the difference between spending $3,000/month and $200/month on GPT-4o input tokens. I wrote a detailed cost breakdown if you want the math.

The aggressive extraction comes at a cost though — Purify strips harder than Firecrawl. Sidebar content, related article links, embedded widgets, and some dynamically inserted elements get removed. For RAG pipelines and agent workflows, this is usually what you want. For cases where you need the full page context, Firecrawl or Crawl4AI might serve you better.

Built-in MCP server

Purify ships with a built-in MCP (Model Context Protocol) server. If you use Claude Desktop or Cursor, connecting takes one config file:

{
  "mcpServers": {
    "purify": {
      "command": "npx",
      "args": ["-y", "purify-mcp"],
      "env": {
        "PURIFY_API_KEY": "your-api-key"
      }
    }
  }
}

After that, your AI assistant can scrape any URL directly. Ask Claude to "read the docs at https://example.com/api" and it fetches clean Markdown through Purify, automatically. No code, no browser extensions, no copy-pasting.

Firecrawl has a community-maintained MCP server. Crawl4AI and Jina Reader don't have official ones yet. I baked MCP in from the start because I think this is how most AI tools will access the web within a year or two.

What Purify can't do

I'd rather you know the trade-offs before you invest time setting it up.

No recursive crawling. Purify scrapes one URL per request. If you need to crawl an entire documentation site following links, use Firecrawl or Crawl4AI. I might add this eventually, but it's not a priority — the single-page use case is what most AI agents need.

Smaller ecosystem. Firecrawl has 40,000+ GitHub stars, an active Discord, and years of Stack Overflow answers. Purify is new. If you hit a weird edge case, you're probably filing the first issue for it.

Over-aggressive extraction on some sites. I've seen Purify strip content that was actually relevant — embedded tweets, interactive data visualizations, content loaded through <iframe> elements. The heuristics are improving with every release, but today there are sites where Firecrawl produces better output because it keeps more of the page.

No async job queue. Firecrawl lets you submit a batch of URLs and get a webhook when they're done. Purify is synchronous — one request, one response. For large batch jobs, you'd need to build your own queue around it.

When to use Purify (and when not to)

Use Purify when:

You need to scrape individual URLs and get clean Markdown back
Self-hosting simplicity matters (single binary, 30MB RAM)
Token cost is a priority for your LLM workflows
You want MCP integration out of the box
AGPL licensing is a dealbreaker (Purify is Apache 2.0)

Use something else when:

You need recursive site crawling → Firecrawl
You're a Python shop and want a native library → Crawl4AI
You need zero-setup prototyping → Jina Reader (r.jina.ai/url)
You need built-in async job management → Firecrawl

Try it

The whole thing is open source under Apache 2.0:

# Docker (easiest)
docker run -p 8080:8080 ghcr.io/easonliuliang/purify
 
# Then scrape something
curl "http://localhost:8080/api/v1/scrape?url=https://news.ycombinator.com"

Or use the hosted API — 1,000 free requests/month, no credit card.

GitHub: github.com/Easonliuliang/purify

If you find a URL where Purify gives you garbage output, I want to know. Those edge cases are how the extraction gets better. Open an issue or email me at hello@purify.verifly.pro.