Single-modality retrieval 會喺 edge cases 失手:dense vectors 漏 rare tokens 同 IDs;pure lexical 漏 paraphrase 同 semantic similarity。Hybrid retrieval 將互補 signals 融合 — dense semantic、sparse lexical、structured metadata、temporal freshness — 產生穩定 high-precision candidate sets。呢篇文章講 architecture、normalization、scoring fusion、failure handling 同 evaluation。
Motivation
Failure scenarios:
- Dense model 漏 proper nouns / SKU codes。
- Pricing change queries 拉到 stale snapshot,因為冇 temporal boost。
- 長 natural questions 喺 sparse-only system 入面被 stopwords overweight。
- Semantically broad pages,例如 marketing fluff,產生 vector false positives,但冇 lexical anchoring。
Hybrid 透過捕捉互相獨立嘅 evidence dimensions 去 mitigate。
Component Layering
Recommended flow:
- Query Embedding -> ANN search (k_vec)
- Lexical Search (BM25 / SPLADE / Elasticsearch) (k_lex)
- Union -> Score Normalization (per source scaling)
- Metadata Filter Pass (locale, access_tier, page_type)
- Diversity & Freshness Adjustments
- Optional Cross/Mono Re-Ranker
- Final Truncation (top K)
保留 raw pre-fusion scores 供 audit。
Query Normalization
Steps:
- Unicode normalize NFKC
- Lowercase,但如有需要,保留 casing snapshot 供 answer formatting
- Tokenize 並 preserve stopwords,因為 semantic embeddings 可以利用 context
- Synonym / Alias Expansion:append alternative tokens,用於 internal product codename mapping;唔插入 model prompt,只用於 sparse retrieval。
- Numeric & Version Extraction:捕捉 X.Y.Z patterns 做 targeted lexical scoring。
Metadata & Attribute Filters
喺 initial candidate union 之後套用 filters,可以減少 recall loss。Common fields:locale、access_tier、page_type、product_area、updated_bucket。Security filters (tenant / tier) 要喺 scoring fusion 之前 enforce,避免 leakage 影響 re-ranking。提供 debug mode 回傳 filtered_out set 供 inspection。
Re-Ranking Strategy
喺 top N (10-20) 上用 lightweight cross-encoder,例如 distilled model。如果 latency > budget,就 degrade:skip re-rank,或者降低 candidate count 同時提高 lexical weight。追蹤 re_rank_delta = MRR_post - MRR_pre 去 justify cost。對 short TTL 內 identical union sets cache re-rank results。
Freshness & Temporal Signals
計 freshness_weight = exp(-lambda * age_days),lambda 按 content type tune,例如 pricing 高啲、API stable 低啲。合併:final_score = w_sem * sem_score + w_lex * lex_score + w_fresh * freshness_weight + w_meta * meta_priors。先 normalize 每個 component,z-score 或 min-max,避免 dominance。
Failure Modes
| Failure | Cause | Mitigation |
|---|---|---|
| Popularity Bias | Overweight lexical tf-idf | Cap term frequency contribution |
| Stale Results | Freshness weight mis-tuned | Recalibrate lambda using evaluation set |
| Locale Leakage | Late filter application | Move security filters earlier |
| Semantic Drift | Embedding model upgrade | Dual-index and A/B compare before rollout |
| Over-fusion Noise | Unbounded union size | Limit union, diversity pruning |
Evaluation Framework
Experiments:
- Ablation:(vector only、lexical only、hybrid w/o rerank、full) 量度 Recall@k、MRR。
- Fusion Weight Tuning:用 validation gold set grid search weights。
- Latency Budget:追蹤每個 configuration 嘅 mean + P95 retrieval latency。
- Drift:每週 monitor head vs tail queries 嘅 recall relative change。
保留帶 config hashes 嘅 evaluation manifest。
Optimization Loop
Cycle:
- Log retrieval traces:query、candidates、scores、source_tag。
- 找出 mis-hits:downstream faithfulness 低或 citation count 低 -> classify root cause,例如 missing lexical candidate、semantic false positive、stale content。
- 調整 weights / thresholds;跑 offline suite。
- 將新 fusion weights 放喺 feature flag 後面 canary。
- 如果 improvement statistically significant,就 promote。
重點
- Hybrid retrieval 係一套可 tune 嘅 dials,要不斷 instrument。
- Security & access filters 要早套用;避免 leakage 入 scoring。
- Re-ranking 必須用可量度 MRR / Recall lift justify latency。
- Temporal decay 防止 outdated、high-authority pages 主導結果。
- Fusion changes 當 code 處理:version、evaluate、roll forward 或 back。