Hybrid Retrieval：Vector + Keyword + Metadata

Single-modality retrieval 會喺 edge cases 失手：dense vectors 漏 rare tokens 同 IDs；pure lexical 漏 paraphrase 同 semantic similarity。Hybrid retrieval 將互補 signals 融合 — dense semantic、sparse lexical、structured metadata、temporal freshness — 產生穩定 high-precision candidate sets。呢篇文章講 architecture、normalization、scoring fusion、failure handling 同 evaluation。

Motivation

Failure scenarios：

Dense model 漏 proper nouns / SKU codes。
Pricing change queries 拉到 stale snapshot，因為冇 temporal boost。
長 natural questions 喺 sparse-only system 入面被 stopwords overweight。
Semantically broad pages，例如 marketing fluff，產生 vector false positives，但冇 lexical anchoring。

Hybrid 透過捕捉互相獨立嘅 evidence dimensions 去 mitigate。

Component Layering

Recommended flow：

Query Embedding -> ANN search (k_vec)
Lexical Search (BM25 / SPLADE / Elasticsearch) (k_lex)
Union -> Score Normalization (per source scaling)
Metadata Filter Pass (locale, access_tier, page_type)
Diversity & Freshness Adjustments
Optional Cross/Mono Re-Ranker
Final Truncation (top K)

保留 raw pre-fusion scores 供 audit。

Query Normalization

Steps：

Unicode normalize NFKC
Lowercase，但如有需要，保留 casing snapshot 供 answer formatting
Tokenize 並 preserve stopwords，因為 semantic embeddings 可以利用 context
Synonym / Alias Expansion：append alternative tokens，用於 internal product codename mapping；唔插入 model prompt，只用於 sparse retrieval。
Numeric & Version Extraction：捕捉 X.Y.Z patterns 做 targeted lexical scoring。

喺 initial candidate union 之後套用 filters，可以減少 recall loss。Common fields：locale、access_tier、page_type、product_area、updated_bucket。Security filters (tenant / tier) 要喺 scoring fusion 之前 enforce，避免 leakage 影響 re-ranking。提供 debug mode 回傳 filtered_out set 供 inspection。

Re-Ranking Strategy

喺 top N (10-20) 上用 lightweight cross-encoder，例如 distilled model。如果 latency > budget，就 degrade：skip re-rank，或者降低 candidate count 同時提高 lexical weight。追蹤 re_rank_delta = MRR_post - MRR_pre 去 justify cost。對 short TTL 內 identical union sets cache re-rank results。

Freshness & Temporal Signals

計 freshness_weight = exp(-lambda * age_days)，lambda 按 content type tune，例如 pricing 高啲、API stable 低啲。合併：final_score = w_sem * sem_score + w_lex * lex_score + w_fresh * freshness_weight + w_meta * meta_priors。先 normalize 每個 component，z-score 或 min-max，避免 dominance。

Failure Modes

Failure	Cause	Mitigation
Popularity Bias	Overweight lexical tf-idf	Cap term frequency contribution
Stale Results	Freshness weight mis-tuned	Recalibrate lambda using evaluation set
Locale Leakage	Late filter application	Move security filters earlier
Semantic Drift	Embedding model upgrade	Dual-index and A/B compare before rollout
Over-fusion Noise	Unbounded union size	Limit union, diversity pruning

Evaluation Framework

Experiments：

Ablation：(vector only、lexical only、hybrid w/o rerank、full) 量度 Recall@k、MRR。
Fusion Weight Tuning：用 validation gold set grid search weights。
Latency Budget：追蹤每個 configuration 嘅 mean + P95 retrieval latency。
Drift：每週 monitor head vs tail queries 嘅 recall relative change。

保留帶 config hashes 嘅 evaluation manifest。

Optimization Loop

Cycle：

Log retrieval traces：query、candidates、scores、source_tag。
找出 mis-hits：downstream faithfulness 低或 citation count 低 -> classify root cause，例如 missing lexical candidate、semantic false positive、stale content。
調整 weights / thresholds；跑 offline suite。
將新 fusion weights 放喺 feature flag 後面 canary。
如果 improvement statistically significant，就 promote。

重點

Hybrid retrieval 係一套可 tune 嘅 dials，要不斷 instrument。
Security & access filters 要早套用；避免 leakage 入 scoring。
Re-ranking 必須用可量度 MRR / Recall lift justify latency。
Temporal decay 防止 outdated、high-authority pages 主導結果。
Fusion changes 當 code 處理：version、evaluate、roll forward 或 back。