Enterprise AI

Chunking Strategies for RAG: How to Optimize Document Retrieval

Feb 24, 2026

StackAI

AI Agents for the Enterprise

StackAI

AI Agents for the Enterprise

Chunking Strategies for RAG: How to Optimize Document Retrieval

Retrieval-augmented generation lives or dies by what it retrieves. If your system keeps pulling irrelevant passages, missing key definitions, or returning near-duplicates, the issue often isn’t the model or the vector database. It’s chunking.

This guide breaks down practical chunking strategies for RAG that improve retrieval quality across real enterprise document types, from PDFs and policies to Markdown, HTML, and code. You’ll learn how to choose a strategy, tune RAG chunk size and chunk overlap, add metadata for RAG retrieval, and evaluate results with a repeatable workflow.

What “Chunking” Means in RAG (and Why It Matters)

Chunking in RAG is the process of splitting source documents into smaller pieces (chunks) that get embedded and stored for retrieval. Those chunks are the unit your retriever can fetch, rerank, and pass into the model as context.

Why chunking matters is simple: embeddings represent the meaning of whatever text you feed them. If a chunk is incoherent, too broad, or missing critical context, the retriever is more likely to surface the wrong evidence. And when retrieval is wrong, generation becomes guesswork.

Common symptoms of poor chunking include:

You retrieve passages that are adjacent to the right answer but don’t contain it
Definitions are split across chunk boundaries, so the model misses the key line
Chunks are too large and become “topic soup,” diluting semantic signal
Chunks are too small and lose the conditions, exceptions, or surrounding context
Chunk overlap causes duplicate results, increasing cost without improving accuracy
PDF chunking breaks tables or merges boilerplate headers into every chunk

Two hard constraints shape chunking strategies for RAG:

Embedding model input limits: you can’t embed arbitrarily long text
Context window budgets: even if you retrieve the right chunks, you may not have space to include them all

That’s why chunking is not a one-time setting. It’s an engineering trade-off between retrieval precision, context completeness, latency, and cost.

The 4 Core Chunking Strategies (With Pros/Cons)

There’s no universal best choice among chunking strategies for RAG. The right approach depends on document structure, query patterns, and how you evaluate retrieval. Most production systems start with a baseline splitter and evolve into hybrid approaches as they see failure modes in logs.

Fixed-size chunking (token or character windows)

Fixed-size chunking splits text into uniform windows. Token-based chunking is usually preferable to character-based chunking because it better matches how embedding models and LLMs measure length.

How it works:

Choose a chunk size (for example, 500 tokens)
Optionally choose an overlap (for example, 50 tokens)
Slide a window across the text and emit chunks

Pros:

Simple, fast, and predictable
Easy to tune with a size/overlap grid
Works reasonably well for uniform prose

Cons:

Splits semantics: you can cut mid-sentence, mid-list, mid-table, or mid-code block
Can separate a definition from the term it defines
Often requires overlap to avoid boundary losses, increasing storage and duplication

When it’s good enough:

MVPs and prototypes
Homogeneous content (e.g., articles with consistent paragraph structure)
Corpora where you expect reranking to fix some retrieval noise

Sentence-based chunking

Sentence-based chunking groups complete sentences together up to a maximum length. The goal is to preserve semantic coherence by avoiding mid-sentence splits.

Pros:

Produces chunks that read naturally
Improves retrieval for Q&A style queries and conversational content
Reduces the need for overlap in many cases

Cons:

Some domains have extremely long sentences (legal and compliance text)
OCR and PDF extraction often produce broken punctuation, harming sentence detection
Sentence-only segmentation can still mix topics if the document has long sections without clear breaks

Where it shines:

Support knowledge bases and FAQs
Short product documentation pages
Internal wikis with clean text

Recursive chunking (structure-aware fallback)

Recursive chunking tries to split on larger separators first (like headings or paragraphs), then falls back to smaller separators (lines, sentences, words) until the chunk fits within the size limit.

A typical recursive hierarchy might be:

Double newline (paragraph) → newline (line) → sentence boundary → word boundary

Pros:

Strong default strategy for mixed corpora
Less likely to break paragraphs or lists compared to fixed windows
Works well with Markdown and reasonably well with HTML-to-text conversions

Cons:

Still doesn’t “understand” meaning; it uses heuristics
Requires careful separator choices per content type
Can behave unpredictably with messy extraction (common in PDFs)

Practical separators to consider:

Markdown/HTML: split on headers first, then paragraphs
Code: try splitting on class/function boundaries before falling back to lines

Recursive chunking is often the baseline because it balances quality and simplicity without heavy compute.

Semantic chunking (topic-shift aware)

Semantic chunking aims to keep each chunk focused on a single topic. Instead of splitting based on characters or punctuation, it detects topic shifts. A common approach is to embed smaller units (like sentences or short paragraphs) and group them into chunks until semantic similarity drops below a threshold.

Pros:

Higher coherence per chunk
Often improves retrieval for complex docs with multiple subtopics per page
Reduces “topic soup” chunks that match too broadly

Cons:

More compute during ingestion
Threshold tuning is non-trivial and corpus-specific
Poor extraction quality (especially from PDFs) can confuse topic detection

When to use it:

High-value assistants where accuracy matters more than ingestion cost
Large corpora with dense, multi-topic sections (handbooks, policies, long technical guides)
Systems that already use reranking and want to improve the candidate set quality

Chunk Size & Overlap: The Highest-Leverage Parameters

Even with a solid strategy, RAG chunk size and chunk overlap are the knobs that most directly change retrieval behavior.

Chunk size trade-offs (precision vs context)

Smaller chunks tend to improve precision because each chunk’s embedding focuses on a narrower idea. But you pay in other ways: more chunks to store, more candidates to retrieve, and a higher chance that critical context is missing.

Larger chunks preserve context and reduce the risk of missing adjacent conditions or exceptions. But large chunks often dilute the embedding signal. If a chunk contains several topics, it may match a query weakly and get outranked by a smaller, more focused chunk elsewhere.

Practical starting points:

FAQ/support content: 200–400 tokens
Technical docs: 400–800 tokens
Legal/regulatory text: 800–1,200 tokens (prefer structure-first splitting)
Code: function/class sized chunks, often 200–600 tokens depending on style

If you’re unsure, a common baseline in chunking strategies for RAG is 400–600 tokens because it’s large enough to hold a meaningful paragraph cluster but small enough to avoid heavy topic mixing in most prose.

Overlap (sliding window): when it helps vs hurts

Chunk overlap adds repeated text between adjacent chunks. The purpose is to prevent boundary loss. For example, if a definition starts at the end of one chunk and finishes at the beginning of the next, overlap increases the chance at least one chunk contains the full definition.

When overlap helps:

Fixed-size chunking that frequently splits mid-thought
Documents with critical cross-sentence context (definitions, requirements, step-by-step procedures)
Corpora where users ask for edge cases or exceptions often located near boundaries

When overlap hurts:

Duplicate retrieval results (you get the same evidence phrased slightly differently)
Higher storage and embedding costs
Increased prompt token usage if duplicates make it into context
Rerankers can become less effective because many candidates are near-identical

Heuristics that work well in practice:

Start at 10–20% overlap
Prefer sentence-based overlap when possible (overlap whole sentences rather than raw tokens)
Watch retrieval logs for “adjacent duplicate chunks” appearing together in top results; it’s often a sign overlap is too high

Top-K retrieval and prompt budget math (often overlooked)

Chunking decisions should be tied to how many chunks you plan to retrieve and how much context you can afford to include.

A simple budget check:

top_k × avg_chunk_tokens + system_instructions + user_query + tool_output < context_window

If you retrieve 8 chunks averaging 700 tokens each, you’re at 5,600 tokens before the model has generated anything. That may be fine for a large context window, but it can force you to truncate or reduce top_k, both of which can harm answer quality.

How chunking interacts with reranking:

If you use a cross-encoder reranker, you can retrieve a larger candidate set (higher top_k) and then select fewer chunks for the prompt. That often allows slightly smaller chunks without losing recall.
Without reranking, chunk coherence matters more because you’re relying on vector similarity ranking alone.

Structure-First Chunking for Real-World Documents (PDFs, Markdown, HTML, Tables, Code)

In enterprise settings, documents are messy. The best chunking strategies for RAG often start by respecting structure and only then enforcing size limits.

Markdown/HTML: chunk by headers, then enforce max size

For Markdown and HTML-derived text, the most reliable structure is the heading hierarchy. Chunk first by header sections, then split oversized sections recursively.

Two practical improvements:

Include a breadcrumb in chunk text, such as: Document Title → H1 → H2 → H3 This helps embeddings disambiguate similar subsections (for example, “Retention” under “Security” vs “Retention” under “Data Management”).
Store the header path as metadata (section_path) so you can filter or display it in results.

This approach typically improves retrieval specificity because queries often implicitly refer to where something lives in the document.

PDFs: page-aware and layout-aware chunking

PDF chunking is hard because PDFs aren’t structured like HTML. They’re positioned layouts. Extraction can introduce problems like:

Practical guidance:

When PDF extraction quality is low, semantic chunking often performs worse than expected because topic boundaries become noise. In those cases, page/section chunking plus cleanup can outperform more sophisticated methods.

Code: AST- or symbol-aware chunking (when available)

For codebases, the most useful unit of retrieval is rarely “every 500 tokens.” It’s symbols: functions, classes, and modules.

Best practices:

If you can use an AST parser, do it. If not, regex-based heuristics that detect function signatures and class declarations can still help.

Do-not-split rules to adopt early:

Do not split inside fenced code blocks
Do not split inside tables
Do not split inside admonitions/callouts (common in docs)
Do not split between a heading and the first paragraph under it

Cleaning steps that often pay off more than tuning:

Remove nav menus and repeated sidebars from scraped HTML
Normalize whitespace and fix hyphenation in PDFs
Deduplicate repeated boilerplate across documents (legal disclaimers, copyright lines)

Advanced Patterns That Often Beat “Bigger/Smaller Chunks”

Once you’ve tuned basic chunk size and chunk overlap, the next gains usually come from changing the retrieval architecture, not squeezing chunk sizes.

Parent–child (hierarchical) chunking

Hierarchical chunking indexes small child chunks for precision, but retrieves larger parent chunks for generation context.

How it works:

Metadata linking pattern: * parent_id, child_id, section_path, chunk_index

Why it’s powerful:

* You get precise matching from child embeddings

* You provide coherent context from parent sections

* You reduce the chance that the model sees fragmented evidence

This is one of the most effective chunking strategies for RAG in policy-heavy or handbook-heavy corpora.

Contextual headers / chunk prefaces

Even without parent–child retrieval, you can often improve retrieval by adding a short preface to each chunk before embedding.

Good preface fields:

* document title

* section path

* product name and version

* effective date (for policies)

* source type (SOP, contract, runbook, release note)

Keep it short and consistent. The goal is to anchor the chunk semantically so the embedding reflects both content and context.

Late chunking / global-context embeddings (conceptual)

Late chunking is the idea of preserving broader document context while still retrieving smaller units. Implementations vary, but the principle is to reduce the penalty of splitting by allowing chunk representations to reflect more of the surrounding document.

When to consider it:

* Very long documents with heavy cross-references

* Manuals where definitions appear early and get referenced later

* Corpora where users ask questions that require combining evidence from multiple sections

If you’re not ready to adopt it, hierarchical chunking and good metadata can capture much of the benefit.

Query-aware chunking (dynamic)

Query-aware chunking adjusts chunk boundaries at query time. It can be effective in small corpora where you can afford dynamic processing, but it’s rare in production because you can’t precompute embeddings for every possible chunk.

When it’s worth it:

* Small, high-value corpora

* Investigations and research workflows where latency is less critical

* Systems that already run heavier pipelines (reranking, multi-step reasoning) per query

Metadata: The Hidden Half of Retrieval Quality

Many teams focus on chunking strategies for RAG and forget that retrieval is not just similarity search. Real systems need filtering, traceability, and reliable context selection. Metadata makes that possible.

Must-have metadata fields:

* source file or URL

* document title

* author/owner (if relevant)

* created_at and updated_at

* section headers (section_path)

* page number (for PDFs)

* chunk index within document

* version/product tags (product, version, locale)

* access control tags (department, tenant_id, permission level)

What metadata enables:

* Filtered retrieval: “only the 2025 policy,” “only EU region,” “only version 3.2 docs”

* Better user trust: showing where the answer came from

* Multi-tenant isolation: preventing cross-tenant leakage in shared indexes

* Cleaner evaluation: grouping errors by document type, owner, or version

If you’re building for enterprise workflows, metadata design is often as important as RAG chunk size.

How to Evaluate Chunking Strategies (Not Guess)

Chunking is an experiment, not a belief. The only reliable way to improve is to measure retrieval quality and iterate.

Build a “golden set” of questions

Start with 30–100 representative queries. Pull them from:

* search logs

* support tickets

* internal Slack questions

* common onboarding requests

* domain expert interviews

For each query, label:

* the relevant source document(s)

* the specific passage(s) that contain the answer (or the section where it lives)

You don’t need perfect labeling. Lightweight, consistent labeling beats intuition.

Offline retrieval evaluation metrics to track

Track a small set of retrieval evaluation metrics that reflect different failure modes:

* Recall@K: did any of the top K chunks contain the relevant evidence?

This often sets the ceiling on answer quality.

* Precision@K: how many of the retrieved chunks are actually relevant?

This controls noise and prompt waste.

* MRR (Mean Reciprocal Rank): how high does the first relevant chunk appear?

This reflects ranking quality.

* nDCG: rewards ranking relevant chunks higher, especially when there are multiple relevant chunks.

Also track operational metrics:

* average chunks per document (index growth)

* retrieval latency (especially with reranking)

* token usage per answer (prompt cost)

* duplicate chunk rate in top results (overlap side effect)

Run A/B tests across chunkers

Test chunkers like you’d test models: hold everything else constant.

A practical bake-off:

1. Choose a baseline (recursive chunking is a common start)

2. Run a chunk size × overlap grid (e.g., 300/500/800 tokens × 0/10%/20% overlap)

3. Compare recursive vs sentence-based vs semantic chunking

4. Compare flat chunks vs parent–child hierarchical chunking

5. If you use reranking, evaluate with and without it

Then do error analysis. Look for patterns like:

* Boundary splits that remove definitions or exceptions

* Tables being broken across chunks

* Duplicate near-neighbor chunks dominating top results

* Overly broad chunks matching too many queries weakly

* Boilerplate text being retrieved (headers, footers, nav)

This process turns chunking strategies for RAG into a controlled optimization problem rather than a guessing game.

Decision Framework: Which Chunking Strategy Should You Use?

Use this as a text-based decision tree:

6. If your documents have strong structure (Markdown, clean HTML, manuals with headings):

Start with header-based chunking, then apply recursive splitting for oversized sections.

7. If your corpus is mixed formats (PDFs, docs, wikis, code):

Route by file type and apply the best splitter per type (PDF page/section rules, header rules for Markdown, symbol rules for code).

8. If accuracy is the top priority and you can spend more on ingestion and retrieval:

Use semantic chunking plus reranking, or adopt parent–child hierarchical chunking.

9. If speed and scale are the priority:

Use recursive chunking with tuned separators and modest overlap, then add reranking later if needed.

Starter configurations that work for many teams:

* Default baseline: recursive chunking, 400–600 tokens, 10–15% overlap

* FAQ baseline: sentence-based or Q&A unit chunks, minimal overlap

* PDF baseline: page/section-based splitting, aggressive header/footer cleanup, table preservation rules

Implementation Notes (LangChain/LlamaIndex + Practical Tips)

You don’t need to tie yourself to one framework, but most stacks follow the same pattern: parse → clean → split → add metadata → embed → index.

Recursive splitter pseudo-code

separators = ["\n\n", "\n", ". ", " ", ""]

separators = ["\n\n", "\n", ". ", " ", ""]

separators = ["\n\n", "\n", ". ", " ", ""]

separators = ["\n\n", "\n", ". ", " ", ""]

Tip: tokenization matters. If your splitter estimates tokens differently than your embedding model, you’ll get inconsistent chunk sizes. Prefer token-aware splitting or measure with the embedding tokenizer.

Semantic splitter pseudo-code (conceptual)

units = split_into_sentences(text)
unit_embeddings = embed(units)

chunks = []
current = [units[0]]

for i in range(1, len(units)):
   if similarity(unit_embeddings[i], unit_embeddings[i-1]) < threshold:
       chunks.append(join(current))
       current = [units[i]]
   else:
       current.append(units[i]

units = split_into_sentences(text)
unit_embeddings = embed(units)

chunks = []
current = [units[0]]

for i in range(1, len(units)):
   if similarity(unit_embeddings[i], unit_embeddings[i-1]) < threshold:
       chunks.append(join(current))
       current = [units[i]]
   else:
       current.append(units[i]

units = split_into_sentences(text)
unit_embeddings = embed(units)

chunks = []
current = [units[0]]

for i in range(1, len(units)):
   if similarity(unit_embeddings[i], unit_embeddings[i-1]) < threshold:
       chunks.append(join(current))
       current = [units[i]]
   else:
       current.append(units[i]

units = split_into_sentences(text)
unit_embeddings = embed(units)

chunks = []
current = [units[0]]

for i in range(1, len(units)):
   if similarity(unit_embeddings[i], unit_embeddings[i-1]) < threshold:
       chunks.append(join(current))
       current = [units[i]]
   else:
       current.append(units[i]

Thresholds are corpus-specific. Tune them using your golden set and retrieval metrics, not intuition.

Metadata injection pattern

Two gotchas that cause silent quality regressions:

* Embedding the wrong text (for example, embedding raw chunk text but displaying a different augmented text during generation)

* Letting duplicated boilerplate dominate embeddings (common in PDFs and scraped HTML)

Conclusion: A Practical Loop for Better Retrieval

The best chunking strategies for RAG aren’t about finding a magic chunk size. They’re about building a repeatable loop that improves retrieval quality for your specific corpus.

Use this three-step cycle:

10. Pick a baseline chunking strategy for RAG (recursive or structure-first)

11. Tune RAG chunk size and chunk overlap based on your prompt budget and query patterns

12. Evaluate with a golden set and retrieval evaluation metrics, then iterate based on error analysis

When your chunking is right, everything downstream improves: reranking gets easier, prompts get smaller, and answers become more grounded in the actual source material.

Book a StackAI demo: https://www.stack-ai.com/demo