Hybrid Retrieval: Vector + Keyword + Metadata

retrieval แบบ modality เดียวมักพลาด edge cases: dense vectors พลาด tokens และ IDs ที่หายาก; pure lexical พลาด paraphrase และ semantic similarity Hybrid retrieval หลอมรวมสัญญาณเสริมกัน ได้แก่ dense semantic, sparse lexical, structured metadata และ temporal freshness เพื่อสร้าง candidate sets ที่เสถียรและ precision สูง บทความนี้อธิบาย architecture, normalization, scoring fusion, failure handling และ evaluation

แรงจูงใจ

สถานการณ์ที่ล้มเหลว:

Proper nouns / SKU codes ถูก dense model พลาด
queries เรื่อง pricing change ดึง snapshot เก่าที่ไม่มี temporal boost
คำถามภาษาธรรมชาติยาว ๆ ถูก overweight ที่ stopwords ในระบบ sparse only
Vector false positives บนหน้ากว้างเชิงความหมาย เช่น marketing fluff ที่ไม่มี lexical anchoring

Hybrid ลดปัญหาด้วยการจับมิติหลักฐานที่ตั้งฉากกัน

การวางชั้น Component

flow ที่แนะนำ:

Query Embedding → ANN search (k_vec)
Lexical Search (BM25 / SPLADE / Elasticsearch) (k_lex)
Union → Score Normalization (per source scaling)
Metadata Filter Pass (locale, access_tier, page_type)
Diversity & Freshness Adjustments
Optional Cross/Mono Re-Ranker
Final Truncation (top K)

เก็บ raw pre-fusion scores ไว้สำหรับ audit

Query Normalization

ขั้นตอน:

Unicode normalize NFKC
Lowercase โดยเก็บ casing snapshot ไว้สำหรับ answer formatting หากจำเป็น
Tokenize และเก็บ stopwords เพราะ semantic embeddings ใช้บริบทได้
Synonym / Alias Expansion: เติม tokens ทางเลือกสำหรับ mapping codename ภายในของผลิตภัณฑ์ ไม่ใส่เข้า model prompt และใช้เฉพาะ sparse retrieval
Numeric & Version Extraction: จับ pattern X.Y.Z สำหรับ lexical scoring ที่เจาะจง

การใช้ filters หลัง initial candidate union ช่วยลด recall loss ฟิลด์ทั่วไป: locale, access_tier, page_type, product_area, updated_bucket บังคับ security filters (tenant / tier) ก่อน scoring fusion เพื่อป้องกัน leakage มีผลต่อ re-ranking จัด debug mode ที่คืน filtered_out set เพื่อ inspection

กลยุทธ์ Re-Ranking

ใช้ cross-encoder ที่เบา เช่น distilled model บน top N (10-20) หาก latency > budget ให้ degrade: ข้าม re-rank หรือ ลด candidate count พร้อมเพิ่ม lexical weight ติดตาม re_rank_delta = MRR_post - MRR_pre เพื่อพิสูจน์ต้นทุน Cache ผล re-rank สำหรับ union sets เดิมภายใน TTL สั้น

Freshness & Temporal Signals

คำนวณ freshness_weight = exp(-lambda * age_days) โดย tune lambda ตาม content type เช่น pricing สูงกว่า API ที่คงที่ต่ำกว่า รวมคะแนน: final_score = w_sem * sem_score + w_lex * lex_score + w_fresh * freshness_weight + w_meta * meta_priors Normalize แต่ละ component ก่อน (z-score หรือ min-max) เพื่อหลีกเลี่ยงการครอบงำ

Failure Modes

Failure	Cause	Mitigation
Popularity Bias	lexical tf-idf น้ำหนักมากเกินไป	จำกัด contribution ของ term frequency
Stale Results	Freshness weight tune ผิด	recalibrate lambda ด้วย evaluation set
Locale Leakage	ใช้ filter ช้าเกินไป	ย้าย security filters ให้เร็วขึ้น
Semantic Drift	อัปเกรด embedding model	Dual-index และ A/B compare ก่อน rollout
Over-fusion Noise	union size ไม่จำกัด	จำกัด union, diversity pruning

Evaluation Framework

การทดลอง:

Ablation: (vector only, lexical only, hybrid w/o rerank, full) วัด Recall@k, MRR
Fusion Weight Tuning: grid search weights ด้วย validation gold set
Latency Budget: ติดตาม mean + P95 retrieval latency ต่อ configuration
Drift: monitor weekly relative change ใน recall สำหรับ head vs tail queries

รักษา evaluation manifest พร้อม config hashes

Optimization Loop

cycle:

Log retrieval traces (query, candidates, scores, source_tag)
หา mis-hits (downstream faithfulness ต่ำ หรือ citation count ต่ำ) → classify root cause (missing lexical candidate, semantic false positive, stale content)
ปรับ weights / thresholds; รัน offline suite
Canary fusion weights ใหม่หลัง feature flag
Promote เมื่อ improvement มีนัยสำคัญทางสถิติ

ประเด็นสำคัญ

Hybrid retrieval คือระบบของปุ่มปรับที่ tune ได้ จง instrument อย่างเข้มข้น
ใช้ security & access filters ตั้งแต่ต้น เลี่ยง leakage เข้า scoring
Re-ranking ต้องพิสูจน์ latency ด้วย MRR / Recall lift ที่วัดได้
Temporal decay ป้องกันหน้าเก่าแต่ authority สูงครอบงำผลลัพธ์
ปฏิบัติต่อ fusion changes เหมือน code: version, evaluate, roll forward or back