RAG 嘅 Chunking 策略

Chunking 會將 raw normalized page content 轉成 retrieval units。選錯方法會推高 cost，因為 fragments 太多；會降低 recall，因為 blocks 太大；或者會稀釋 precision，因為 boundary 切斷咗意思。冇一個 universal best method；strategy 要配合 corpus structure、volatility 同 query patterns。呢份 guide 會整理 design space、trade-offs、evaluation workflow 同 production RAG pipelines 嘅 optimization levers。

點解 Chunking 重要

目標：

盡量令 relevant facts 出現喺 top-k retrieval。
保留 semantic cohesion，令 generated answers 有 grounding。
優化 token utilization，避免重複 embedding boilerplate。
用 stable chunk IDs 支援 deterministic incremental updates。

Chunking 唔對路時，會見到 high redundancy、low Recall@k、hallucinated boundary facts 同 inflated embedding spend。

Fixed Window Chunking

簡單 N-token windows，例如 500 tokens。好處：deterministic、易 implement、update behavior 穩定。壞處：boundary 會切開 concepts；要 redundant overlap 先減少 truncation，cost 會上升。謹慎使用：對 heterogeneous 或 poorly structured content，當 semantic signals 唔可靠時，佢係一個好 baseline。

Overlapping Sliding Windows

Window size W 加 overlap O，例如 500 / 50 tokens，可以減少 boundary 上嘅 fact truncation。Overlap 超過大約 15% 後，recall gains 會遞減，但 index size 會繼續複合式增加。追蹤 duplication_ratio = distinct_token_count / total_token_count，去將 O 調低。

Semantic Boundary Detection

沿 structural signals 分段：H2/H3 headings、list groupings、code blocks、table boundaries。Enforce min/max token bounds：merge undersized siblings，split oversized sections。好處：cohesion 更高、overlaps 更少。風險：malformed markup、heading hierarchy 唔一致。用 hierarchy repair，加上 headings 缺失時 fallback 到 paragraph splitting 去 mitigate。

Hierarchical Chunking

Two-tier index：coarse section embeddings，例如成個 tutorial section，加 fine-grained subchunks。Retrieval flow：coarse ANN -> filter top N sections -> 喺入面 fine retrieval。好處：對大型 corpora 減少 global search space，改善 latency。Complexity：moving parts 多咗，需要 cascade scoring logic。

Adaptive / Dynamic Chunking

按 local semantic density 同 structural cues 調整 chunk sizes。例子：由 heading section 開始；如果 >800 tokens，就按 semantic similarity scoring 嘅 paragraph clusters 分割；如果 <120 tokens，就同下一個 sibling merge，除非 topic divergence 超過 threshold。需要 embedding 或 similarity pre-pass；喺 ingestion 時付一次 cost，換長期 retrieval efficiency。

Embedding 考慮

保留 metadata：token_count、model_version、content_hash。避免 truncation：pre-compute tokens，並喺 model call 前 split。Dense models 面對太多 boilerplate 會變差；pre-chunk 時先移除 navigation artifacts。Monitor vector_density (unique terms / tokens)，搵出 low-signal fragments 作為 re-merge candidates。

Evaluation 方法

每個 strategy 嘅 benchmarks：

Metric	Purpose
Recall@k	Fact retention
Precision@k	Context noise
Chunk Count	Cost indicator
Duplication Ratio	Overlap tuning
Avg Tokens per Chunk	Window utilization
Latency (Retrieval)	Index efficiency

喺 gold query set 上跑；只有當 recall gains 大過 cost 同 latency deltas，先採用 strategy。

Implementation Playbook

Baseline：Fixed 500 + 10% overlap；收集 benchmarks。
引入 Semantic Boundaries：喺 headings 可靠嘅地方取代 windows；重新量度。
如果 corpus >250k chunks 或 latency > target，加 Hierarchical Layer。
對 section sizes variance 高嘅地方 deploy Adaptive logic。
Quarterly Reassessment：比較 cost per quality delta 同新 model capabilities。

每次 iteration 都儲存 chunk manifest diff，方便 rollback。

重點

Semantic boundaries 通常喺 precision/cost 上勝過 pure fixed windows。
Overlap 係 dial：量度 duplication，唔好估。
Hierarchical retrieval 幫 scale，而唔令 latency 線性增長。
Stable chunk IDs 支援 safe incremental embedding refresh。
Strategy changes 要好似 code deploys 咁 evaluate：benchmark、compare、log。