How to reduce AI token costs when scraping the web
Most of what your LLM processes from a web page is useless HTML. Here's how much you can save by cleaning it up first, with real numbers from 5 popular sites.
Last month I noticed one of my agent workflows was burning through $400/week in OpenAI API credits. The agent reads 20-30 web pages per task, feeds them into GPT-4o for analysis, then generates a report. Nothing fancy.
The problem wasn't the model or the prompts. It was the input. I was feeding raw HTML into the model — every <nav>, every <script>, every cookie consent banner. The actual content was maybe 5% of what the model processed. The other 95% was garbage that the model dutifully read, charged me for, and then ignored.
After switching to clean Markdown input, the same workflow costs about $30/week. Same output quality. Here's what I learned.
The benchmark: 5 popular websites
I scraped 5 sites and measured token counts with tiktoken, using the cl100k_base encoding (GPT-4 tokenizer). "Raw" is the full HTML response. "Clean" is the extracted Markdown content.
| Website | Raw HTML | Clean Markdown | Savings | |---------|----------|----------------|---------| | GitHub README | 14,847 tokens | 1,026 tokens | 93.1% | | BBC News article | 32,591 | 1,843 | 94.3% | | Wikipedia page | 28,103 | 3,412 | 87.9% | | Hacker News front page | 5,230 | 631 | 87.9% | | Xiaohongshu post | 51,208 | 892 | 98.3% |
The Xiaohongshu result surprised me. 51,000 tokens of HTML for a post that's basically a few paragraphs of text and some images. All that bloat comes from inline styles, tracking scripts, and React hydration data that gets baked into the server-rendered HTML.
Hacker News, which is famously minimal, still has an 88% waste rate. Even a page with almost no styling carries enough HTML structure to 5x the token count.
What this actually costs
I'll use OpenAI's current pricing as of February 2026:
- GPT-4o: $2.50/1M input tokens
- GPT-4o-mini: $0.15/1M input tokens
- Claude 3.5 Sonnet: $3.00/1M input tokens (per Anthropic's pricing)
For an agent processing 50,000 pages/month (the Pro tier limit):
| | Raw HTML | Clean Markdown | Savings | |---|---|---|---| | Avg tokens/page | ~25,000 | ~1,500 | 94% | | Total tokens/mo | 1.25B | 75M | | | GPT-4o cost | $3,125/mo | $187/mo | $2,938 | | GPT-4o-mini cost | $187/mo | $11/mo | $176 | | Claude 3.5 cost | $3,750/mo | $225/mo | $3,525 |
Add the Purify Pro subscription ($29/mo) to the clean column. You're still saving thousands.
Even on the cheap GPT-4o-mini model, cleaning saves $176/month. On Claude 3.5 Sonnet, it saves $3,525/month. The ROI is absurd.
One thing I want to be honest about: these are best-case numbers. They assume every page has the same waste ratio as my benchmark set. In practice, some pages (especially simple APIs or text-heavy blogs) have less waste. The savings might be 70-80% instead of 90%+. Still worth it, just don't expect 98% on every URL.
Where the tokens go
To understand why the waste is so bad, I broke down a typical news article page:
Navigation menu + header ~3,000 tokens
Inline CSS / <style> blocks ~4,000 tokens
JavaScript / <script> blocks ~8,000 tokens
Sidebar (related articles) ~2,500 tokens
Footer + legal links ~1,500 tokens
Ad slots and tracking pixels ~2,000 tokens
Cookie consent + popups ~1,200 tokens
──────────────────────────────────────────────
Non-content total ~22,200 tokens (85%)
Actual article text ~3,800 tokens (15%)
Your LLM reads all 26,000 tokens. It has no way to skip the navigation or ignore the JavaScript. It processes everything sequentially, and you pay for every token.
How content extraction works
A good scraper doesn't just strip HTML tags. I tried that early on with BeautifulSoup's .get_text() — the output was an unreadable mess. You lose all structure: headings become inline text, tables become comma-separated gibberish, code blocks lose their formatting.
What you need is intelligent extraction:
- DOM analysis — identify the main content container (usually an
<article>,<main>, or the largest text-heavy<div>) - Noise removal — strip
<nav>,<footer>,<aside>,<script>,<style>, and known ad/tracking patterns - Markdown conversion — convert remaining HTML to structured Markdown while preserving headings, lists, tables, links, and code blocks
- Metadata extraction — pull the page title, author, and publication date when available
Purify does this in Go with zero external dependencies. The source is on GitHub under Apache 2.0. You can also use the hosted API:
curl "https://purify.verifly.pro/api/v1/scrape?url=https://news.ycombinator.com" \
-H "Authorization: Bearer YOUR_API_KEY"Other tools that do similar extraction: Firecrawl (TypeScript, AGPL), Crawl4AI (Python, Apache 2.0), and Jina Reader (hosted API). I compared them in detail in the Firecrawl alternatives guide.
Why this matters beyond cost
Token savings aren't just about your API bill. Three other things improve when you clean your input:
Response latency drops. An LLM processing 2,000 tokens generates its first output token noticeably faster than one processing 50,000 tokens. For agent workflows that chain multiple web reads, this compounds. I measured a 3-step research agent going from ~45 seconds total to ~12 seconds after switching to clean input.
Output quality improves. When your context window is full of navigation links and ad copy, the model sometimes references that noise in its output. I've had GPT-4 include "Related Articles" and "Cookie Settings" in its summaries because that text was in the context. Clean input eliminates this.
Effective context window expands. If you're using a 128k context window and filling it with raw HTML, you fit maybe 3-5 pages before hitting the limit. With clean Markdown, the same window holds 50+ pages. That's the difference between an agent that can compare 3 sources and one that can synthesize 50.
Quick implementation
If you want the fastest path to savings:
Option A: MCP server (zero code, works with Claude/Cursor)
Add one config file and your AI assistant can scrape any URL. Setup guide here.
Option B: API call (any language)
import requests
def clean_scrape(url: str) -> str:
resp = requests.get(
"https://purify.verifly.pro/api/v1/scrape",
params={"url": url},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
return resp.json()["markdown"]
# Before: feed raw HTML to your LLM (25,000 tokens)
# After: feed clean Markdown (1,500 tokens)Option C: Self-host (unlimited, no per-request cost)
Download the Purify binary from GitHub releases, run it, and point your code at http://localhost:8080. No API key, no usage limits. Costs you ~$5/month on the smallest DigitalOcean or Hetzner VPS.
What I'd do differently
If I were starting over with my agent project, I'd clean the web input from day one. I spent two months optimizing prompts and trying different models before realizing the input was the problem. The single biggest cost reduction wasn't switching from GPT-4 to GPT-4o-mini — it was cleaning the HTML before sending it to any model.
The token cost is the most visible benefit, but the quality improvement is what actually matters. Clean input, clean output.
Eason Liu
Builder of Purify. Turning messy HTML into clean Markdown so AI agents can read the web.