RAG க்கான Chunking Strategies

Chunking என்பது raw normalized page content-ஐ retrieval units ஆக மாற்றுவது. தவறான தேர்வுகள் cost-ஐ உயர்த்தும், recall-ஐ குறைக்கும், அல்லது precision-ஐ dilute செய்யும். Universal best method இல்லை; corpus structure, volatility, query patterns ஆகியவற்றுக்கு ஏற்ப strategy தேர்வு செய்ய வேண்டும்.

Chunking ஏன் முக்கியம்

நோக்கங்கள்:

Relevant facts top-k retrieval-ல் தோன்றும் probability-ஐ அதிகரிக்க.
Generated answers grounded ஆக இருக்க semantic cohesion-ஐ காக்க.
Boilerplate-ஐ மீண்டும் மீண்டும் embed செய்வதைத் தவிர்த்து token utilization-ஐ optimize செய்ய.
Stable chunk IDs மூலம் deterministic incremental updates enable செய்ய.

Misaligned chunking high redundancy, low Recall@k, hallucinated boundary facts, inflated embedding spend ஆகியதாக வெளிப்படும்.

Fixed Window Chunking

எளிய N-token windows, உதாரணமாக 500 tokens. Pros: deterministic, implement செய்ய எளிது, update behavior stable. Cons: concepts நடுவில் வெட்டப்படலாம்; truncation குறைக்க overlap தேவை, அதன் மூலம் cost உயரும். Semantic signals நம்பகமில்லாத heterogeneous content-க்கு baseline ஆக மட்டும் பயன்படுத்துங்கள்.

Overlapping Sliding Windows

Window size W மற்றும் overlap O (500 / 50 tokens போன்றது) boundary-ல் fact truncation-ஐ குறைக்கும். சுமார் 15% மேல் overlap recall gain-ஐ குறைவாகத் தரும், index size-ஐ அதிகப்படுத்தும். duplication_ratio = distinct_token_count / total_token_count track செய்து O-ஐ tune செய்யுங்கள்.

Semantic Boundary Detection

H2/H3 headings, list groupings, code blocks, table boundaries போன்ற structural signals அடிப்படையில் segment செய்யுங்கள். Min/max token bounds enforce செய்யுங்கள்: undersized siblings merge, oversized sections split. Benefits: cohesion அதிகம், overlap குறைவு. Risks: malformed markup, inconsistent heading hierarchy. Hierarchy repair மற்றும் headings இல்லாதபோது paragraph splitting fallback உதவும்.

Hierarchical Chunking

Two-tier index: coarse section embeddings + fine-grained subchunks. Retrieval flow: coarse ANN → top N sections filter → அவற்றுக்குள் fine retrieval. Large corpora-வில் global search space குறையும், latency மேம்படும். Complexity: cascade scoring logic தேவை.

Adaptive / Dynamic Chunking

Local semantic density மற்றும் structural cues அடிப்படையில் chunk size adjust செய்யுங்கள். Heading section-ல் தொடங்கி, >800 tokens என்றால் semantic similarity கொண்டு paragraph clusters ஆக split செய்யுங்கள்; <120 tokens என்றால் topic divergence threshold-ஐ கடக்காவிட்டால் next sibling-உடன் merge செய்யுங்கள். Ingestion-ல் ஒருமுறை extra cost கொடுத்து long-term retrieval efficiency பெறலாம்.

Embedding Considerations

metadata வைத்திருங்கள்: token_count, model_version, content_hash. Model call முன் tokens pre-compute செய்து split செய்யுங்கள். Navigation artifacts-ஐ pre-chunk strip செய்யுங்கள். Low-signal fragments கண்டறிய vector_density monitor செய்யுங்கள்.

Evaluation Methods

Metric	Purpose
Recall@k	Fact retention
Precision@k	Context noise
Chunk Count	Cost indicator
Duplication Ratio	Overlap tuning
Avg Tokens per Chunk	Window utilization
Latency (Retrieval)	Index efficiency

Gold query set-ல் strategies benchmark செய்யுங்கள்; recall gain cost மற்றும் latency deltas-ஐ விட மதிப்பானபோது மட்டுமே adopt செய்யுங்கள்.

Implementation Playbook

Baseline: Fixed 500 + 10% overlap; benchmarks சேகரிக்க.
Headings reliable என்றால் Semantic Boundaries அறிமுகப்படுத்தி re-measure செய்ய.
Corpus >250k chunks அல்லது latency target-ஐ கடக்குமானால் Hierarchical Layer சேர்க்க.
High-variance section sizes-க்கு Adaptive logic deploy செய்ய.
Quarterly reassessment: cost per quality delta-வை new model capabilities உடன் compare செய்ய.

Rollback க்காக ஒவ்வொரு iteration-க்கும் chunk manifest diff store செய்யுங்கள்.

Key Takeaways

Semantic boundaries பொதுவாக fixed windows-ஐ precision/cost-ல் வெல்லும்.
Overlap ஒரு dial; duplication measure செய்யுங்கள், guess செய்யாதீர்கள்.
Hierarchical retrieval linear latency growth இல்லாமல் scale செய்ய உதவும்.
Stable chunk IDs incremental embedding refresh-ஐ safe ஆக்கும்.
Strategy changes-ஐ code deploy போல evaluate செய்யுங்கள்: benchmark, compare, log.