tutorialMCPgetting-started

How to set up an MCP server for web scraping

A step-by-step walkthrough for connecting your AI agent to the web via MCP. Works with Claude Desktop, Cursor, and any MCP-compatible client.

Eason LiuFebruary 3, 20266 min read

I kept copy-pasting URLs into ChatGPT and Claude, asking them to "read this page." It worked sometimes. Other times the model hallucinated the page content or told me it couldn't access URLs. The real problem: even when it worked, I was feeding it raw HTML and burning through tokens on <nav> elements and cookie banners.

MCP fixed this for me. The Model Context Protocol is an open standard that lets AI assistants call external tools — think function calling, but standardized across clients. Once you set up an MCP server for web scraping, your AI assistant can read any URL and get clean Markdown back, without you doing anything.

This guide walks through the setup. It took me about 5 minutes the first time, and about 90 seconds every time after that.

Prerequisites

You need two things:

Node.js 18+ — for the npx command that launches the MCP server. Check with node --version. If you don't have it, grab it from nodejs.org.
An MCP-compatible AI client — Claude Desktop, Cursor, or any client that supports the MCP spec.

You'll also need a Purify API key if you're using the hosted API. The free tier at purify.verifly.pro gives you 1,000 requests/month, no credit card required. If you're self-hosting Purify, you don't need a key at all.

Step 1: Find your MCP config file

The config file location depends on your client:

| Client | Config file path | |--------|-----------------| | Claude Desktop (macOS) | ~/Library/Application Support/Claude/claude_desktop_config.json | | Claude Desktop (Windows) | %APPDATA%\Claude\claude_desktop_config.json | | Cursor | .cursor/mcp.json in your project root |

If the file doesn't exist yet, create it. Start with an empty JSON object: {}.

Step 2: Add the Purify MCP server

Open the config file and add:

{
  "mcpServers": {
    "purify": {
      "command": "npx",
      "args": ["-y", "purify-mcp"],
      "env": {
        "PURIFY_API_KEY": "your-api-key-here"
      }
    }
  }
}

That's the entire config. npx -y purify-mcp downloads and runs the MCP server automatically. No global install, no version management.

If you're running a self-hosted Purify instance instead, point it at your local endpoint:

{
  "mcpServers": {
    "purify": {
      "command": "npx",
      "args": ["-y", "purify-mcp"],
      "env": {
        "PURIFY_BASE_URL": "http://localhost:8080"
      }
    }
  }
}

Step 3: Restart your AI client

Close and reopen Claude Desktop or Cursor. The MCP server starts automatically when the client launches. You should see a tools icon or indicator confirming the server connected.

Step 4: Test it

Ask your AI assistant something like:

"Read the front page of Hacker News and give me a 3-sentence summary."

The assistant calls the Purify MCP scrape tool, fetches the page, strips the HTML, and returns clean Markdown. Your model never sees the navigation bars, ads, or script tags — just the content.

What happens under the hood

The MCP server exposes one tool: scrape. It takes a URL and returns Markdown.

The conversion isn't just tag stripping. Purify's Go-based parser identifies the main content area of the page, removes navigation, ads, scripts, and boilerplate, then converts the remaining HTML to structured Markdown. Headings, links, tables, lists, and code blocks are preserved. Everything else is gone.

The token savings are significant. I ran a few comparisons using tiktoken (the GPT-4 tokenizer):

| Page | Raw HTML | Clean Markdown | Reduction | |------|----------|----------------|-----------| | BBC News article | 32,591 tokens | 1,843 tokens | 94% | | GitHub README | 14,847 | 1,026 | 93% | | Hacker News | 5,230 | 631 | 88% |

At OpenAI's GPT-4o pricing ($2.50/1M input tokens), scraping 1,000 pages with raw HTML costs about $75 in tokens alone. With clean Markdown, it's about $5. That adds up fast when you're building an agent that reads dozens of pages per conversation.

Troubleshooting

Here are the issues I ran into during setup:

"Command not found" or npx hangs. This usually means Node.js isn't on your PATH, or you're running an old version. Verify with node --version — you need 18+. On macOS, if you installed via Homebrew, make sure /opt/homebrew/bin is in your PATH.

Empty or blank responses. Some heavily client-side rendered sites (pure React SPAs with no server-side rendering) return minimal HTML. Purify handles most JavaScript-rendered pages, but if a site loads everything via client-side API calls after page load, the initial HTML might be empty. This is a limitation shared by all scraping tools — the workaround is to check if the target site has an API or RSS feed instead.

Rate limit errors. The free tier allows 1,000 requests/month. During development, I burned through this pretty fast by testing the same URLs repeatedly. Two options: self-host Purify (it's a single Go binary — download and run, no Docker needed), or upgrade to Pro ($29/mo for 50,000 requests).

MCP server doesn't appear in Claude Desktop. Make sure the JSON is valid (no trailing commas — a common mistake). Use a JSON validator if you're not sure. Also check that the config file is in the right location for your OS.

What you can do with this

Once the MCP server is running, your AI assistant can:

Research: "Read these 5 documentation pages and compare their authentication approaches"
Summarize: "What are the top stories on TechCrunch right now?"
Extract: "Read this product page and list the pricing tiers"
Monitor: "Read this changelog page and tell me what changed since last week"

The key advantage over copy-pasting HTML or using browser extensions: the content is pre-cleaned, so your model's context window isn't wasted on junk. You can fit more pages into a single conversation and get better responses because the model only processes relevant content.

Limitations worth knowing

MCP isn't magic. A few things it can't do:

Pages behind login walls — the scraper can only access publicly available pages. It doesn't handle authentication or sessions.
CAPTCHAs — if a site shows a CAPTCHA challenge, the scrape will fail or return the CAPTCHA page.
Extremely dynamic content — live dashboards, real-time feeds, or content that loads via WebSocket connections won't be captured in a single scrape.

For most use cases — documentation, news articles, blog posts, product pages — it works well.

Next steps

See the actual token savings benchmarks with numbers from 5 popular sites
Compare scraping tools in our Firecrawl alternatives guide
Build a full RAG pipeline with web scraping

The MCP spec is still evolving. Anthropic publishes updates at modelcontextprotocol.io. As more tools adopt MCP, the same config pattern will work for databases, APIs, file systems, and other external data sources.

Eason Liu

Builder of Purify. Turning messy HTML into clean Markdown so AI agents can read the web.