guideRAGweb-scraping

Web scraping for RAG: a complete guide

How to build a web scraping pipeline that feeds clean data into your RAG system. Covers content extraction, chunking strategies, embedding, and the mistakes that tank retrieval quality.

Eason LiuMarch 3, 20269 min read

I spent three weeks debugging a RAG system that gave terrible answers. The retrieval scores looked fine — the right chunks were being found. The generation model was GPT-4o, plenty capable. But the answers were noisy, sometimes referencing "Cookie Settings" or "Related Articles" in the middle of a technical explanation.

The problem was upstream. I was scraping documentation pages and embedding the raw HTML. My vector database was full of navigation menus, sidebar links, and footer text. The retriever would match these fragments because they contained relevant keywords ("API", "documentation", "reference") even though they weren't actual documentation content.

Cleaning the HTML before embedding fixed it. This guide covers everything I learned about building a web scraping pipeline for RAG — the parts that worked and the parts that didn't.

Why RAG scraping is different from regular scraping

In traditional scraping, you extract specific data points: prices, emails, product names. Precision matters, coverage doesn't. You define selectors and pull exactly what you need.

RAG scraping is the opposite. You need all the meaningful content from a page — full paragraphs, headings, lists, tables, code blocks — in a format that produces good embeddings. You don't want specific data points. You want the complete knowledge on the page, without the noise.

This distinction matters because it changes which tools work well. A tool optimized for data extraction (like Scrapy or Beautiful Soup with custom selectors) doesn't give you what RAG needs. You need content extraction: the full article text in structured Markdown.

The pipeline, step by step

Step 1: Source discovery

Before scraping, figure out which URLs to include. Three approaches:

Sitemap parsing is the cleanest. Most sites publish a /sitemap.xml file that lists all pages. Parse it, filter to the URL patterns you care about (maybe just /docs/* or /blog/*), and you have your list.

import requests
import xml.etree.ElementTree as ET
 
def get_sitemap_urls(sitemap_url: str) -> list[str]:
    resp = requests.get(sitemap_url)
    root = ET.fromstring(resp.content)
    ns = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
    return [loc.text for loc in root.findall(".//ns:loc", ns)]
 
urls = get_sitemap_urls("https://docs.example.com/sitemap.xml")
# Filter to just the pages you want
doc_urls = [u for u in urls if "/docs/" in u]

Recursive crawling works when there's no sitemap. Start from a root URL, follow internal links, stop at a depth limit. Firecrawl and Crawl4AI handle this well. The downside: you might crawl pages you don't need, and you might miss pages that aren't linked from your starting point.

Manual curation sounds crude, but for small corpora (under 100 pages) it gives you the most control. Paste the URLs into a text file and iterate. I've built several production RAG systems that started this way.

Step 2: Scrape and clean

This is where most RAG pipelines fail quietly. The difference between embedding raw HTML and embedding clean Markdown is massive:

| | Raw HTML | Clean Markdown | |---|---|---| | Tokens per page | ~25,000 | ~1,500 | | Noise in vectors | ~85-95% | ~0% | | Retrieval quality | Poor (matches noise) | Good (matches content) |

Here's the cleaning step:

import requests
 
def scrape_clean(url: str, api_key: str) -> str:
    """Scrape a URL and return clean Markdown."""
    resp = requests.get(
        "https://purify.verifly.pro/api/v1/scrape",
        params={"url": url},
        headers={"Authorization": f"Bearer {api_key}"}
    )
    resp.raise_for_status()
    return resp.json()["markdown"]

The output preserves structure that matters for RAG:

# Authentication
 
All API requests require a Bearer token in the Authorization header.
 
## Getting a token
 
1. Sign up at dashboard.example.com
2. Navigate to Settings → API Keys
3. Click "Create new key"
 
## Token formats
 
| Type | Prefix | Expiry |
|------|--------|--------|
| Test | test_  | 24 hours |
| Live | live_  | Never |

Headings, lists, tables, and code blocks are preserved. Navigation, ads, and JavaScript are gone. When you embed this, every vector represents actual documentation content.

Other tools that do similar extraction: Firecrawl, Crawl4AI, Jina Reader. I compared them in the Firecrawl alternatives guide.

Step 3: Chunk the content

Chunking strategy has a bigger impact on retrieval quality than most people realize. I've tested several approaches and here's what I've settled on:

Heading-based chunking works best for structured content (documentation, tutorials, guides). Split on ## boundaries so each chunk covers one topic.

def chunk_by_heading(markdown: str, source_url: str) -> list[dict]:
    """Split markdown by H2 headings, preserving metadata."""
    chunks = []
    current_lines = []
    current_heading = "Introduction"
 
    for line in markdown.split("\n"):
        if line.startswith("## ") and current_lines:
            chunks.append({
                "text": "\n".join(current_lines),
                "heading": current_heading,
                "source": source_url,
            })
            current_heading = line.lstrip("# ").strip()
            current_lines = [line]
        else:
            current_lines.append(line)
 
    if current_lines:
        chunks.append({
            "text": "\n".join(current_lines),
            "heading": current_heading,
            "source": source_url,
        })
 
    return chunks

Fixed-size chunking with overlap is simpler and works for long-form content where headings are sparse. I use 800 tokens per chunk with 100-token overlap. The overlap prevents cutting a sentence or concept in half.

What I learned the hard way about chunk size:

Chunks over 1,500 tokens hurt retrieval. The embedding captures the "average" semantic meaning of the chunk, and when a chunk covers multiple topics, that average doesn't match any specific query well. The retriever finds the chunk, but half the content is irrelevant, and the LLM has to filter through noise.

Chunks under 200 tokens lose context. A single sentence like "The rate limit is 100 requests per minute" embeds well, but without the surrounding context about which API and which plan, the LLM can't give a useful answer.

800-1,000 tokens is my sweet spot. Specific enough to match queries, long enough to provide context.

Step 4: Embed and store

from openai import OpenAI
 
client = OpenAI()
 
def embed_and_store(chunks: list[dict]):
    """Embed chunks and store with metadata."""
    for chunk in chunks:
        embedding = client.embeddings.create(
            model="text-embedding-3-small",
            input=chunk["text"]
        ).data[0].embedding
 
        # Store in your vector database
        # (Pinecone, Weaviate, Qdrant, Chroma, pgvector, etc.)
        # vector_db.upsert({
        #     "id": f"{chunk['source']}_{chunk['heading']}",
        #     "vector": embedding,
        #     "metadata": {
        #         "text": chunk["text"],
        #         "source": chunk["source"],
        #         "heading": chunk["heading"],
        #     }
        # })

Always store metadata. Source URL, section heading, and the chunk text itself. You need all three for:

  • Citations: showing users where the answer came from
  • Debugging: when the RAG gives wrong answers, you need to find which chunk was retrieved
  • Deduplication: checking if a page has already been indexed

I use text-embedding-3-small ($0.02/1M tokens) for most projects. It's cheap, fast, and the quality is good enough. text-embedding-3-large is better for high-stakes applications but 6x more expensive.

Step 5: Query and generate

def ask(question: str, top_k: int = 5) -> str:
    """Answer a question using RAG."""
    # Embed the question
    q_vec = client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    ).data[0].embedding
 
    # Retrieve relevant chunks
    # results = vector_db.query(vector=q_vec, top_k=top_k)
    # chunks = [r.metadata for r in results]
 
    # Build context with sources
    context_parts = []
    for chunk in chunks:
        context_parts.append(
            f"[Source: {chunk['source']}]\n{chunk['text']}"
        )
    context = "\n\n---\n\n".join(context_parts)
 
    # Generate answer
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the question using the provided context. "
                    "Cite sources using [Source: URL] format. "
                    "If the context doesn't contain the answer, say so."
                ),
            },
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    ).choices[0].message.content

The system prompt matters. Telling the model to cite sources and admit when it doesn't know are simple additions that significantly improve output quality.

Mistakes that tank RAG quality

After building several RAG systems and helping others debug theirs, these are the most common failure modes:

Embedding raw HTML. I already covered this, but it's the #1 mistake. If your vectors contain <nav>, <footer>, and <script> content, your retrieval will return noise. Clean first, always. This single change often improves answer quality more than switching embedding models or retuning hyperparameters.

No refresh pipeline. Web pages change. Documentation gets updated, blog posts get edited, pricing pages change quarterly. If your corpus is a one-time scrape, your RAG answers drift out of date. Build a cron job or webhook that re-scrapes and re-embeds on a schedule. Monthly is fine for most content; weekly for fast-changing pages.

Testing with copy-pasted queries. People test their RAG by asking questions using the exact phrasing from their documents. Of course that works — the query embedding is almost identical to the chunk embedding. Real users ask questions differently. "How do I authenticate?" and "Where do I put my API key?" are the same question, and your RAG needs to handle both.

Ignoring chunk boundaries. When a concept spans two chunks (the explanation starts at the end of one chunk and continues into the next), retrieval might find only one half. Overlapping chunks help, but for critical content, check that your chunking doesn't split important sections.

Cost comparison

For indexing 10,000 web pages into a RAG corpus:

| Step | Raw HTML | Clean Markdown | |------|----------|----------------| | Tokens to embed | ~300M | ~20M | | Embedding cost (text-embedding-3-small) | ~$6.00 | ~$0.40 | | Per-query context cost (GPT-4o, 5 chunks) | ~$0.075 | ~$0.005 | | 1,000 queries/month | $75 | $5 |

Clean Markdown is 15x cheaper for embedding and 15x cheaper per query. Over thousands of queries per month, that's the difference between a viable product and a money pit.

Getting started

If you're building a RAG system with web data:

  1. Pick 10 URLs you want in your corpus
  2. Get a free API key at purify.verifly.pro (1,000 req/month, no credit card)
  3. Scrape → chunk by heading → embed with text-embedding-3-small
  4. Test with 10 questions your actual users would ask (not copy-pasted from the docs)
  5. Check which chunks were retrieved for wrong answers — that tells you if the problem is scraping, chunking, or generation

The scraping and chunking steps take an afternoon to set up. The iteration and tuning takes longer, but the foundation has to be right first. No amount of prompt engineering fixes a vector database full of HTML noise.

Related guides

Eason Liu

Eason Liu

Builder of Purify. Turning messy HTML into clean Markdown so AI agents can read the web.

Ready to try Purify?

Start scraping the web for your AI agents. Free tier, no credit card required.