跳到內容

Hybrid Retrieval:Vector + Keyword + Metadata

工程化 resilient hybrid retrieval:為 website RAG 混合 vector、lexical、metadata 同 temporal signals。

retrieval • hybrid • search • rag

Single-modality retrieval 會喺 edge cases 失手:dense vectors 漏 rare tokens 同 IDs;pure lexical 漏 paraphrase 同 semantic similarity。Hybrid retrieval 將互補 signals 融合 — dense semantic、sparse lexical、structured metadata、temporal freshness — 產生穩定 high-precision candidate sets。呢篇文章講 architecture、normalization、scoring fusion、failure handling 同 evaluation。

Motivation

Failure scenarios:

  • Dense model 漏 proper nouns / SKU codes。
  • Pricing change queries 拉到 stale snapshot,因為冇 temporal boost。
  • 長 natural questions 喺 sparse-only system 入面被 stopwords overweight。
  • Semantically broad pages,例如 marketing fluff,產生 vector false positives,但冇 lexical anchoring。

Hybrid 透過捕捉互相獨立嘅 evidence dimensions 去 mitigate。

Component Layering

Recommended flow:

  1. Query Embedding -> ANN search (k_vec)
  2. Lexical Search (BM25 / SPLADE / Elasticsearch) (k_lex)
  3. Union -> Score Normalization (per source scaling)
  4. Metadata Filter Pass (locale, access_tier, page_type)
  5. Diversity & Freshness Adjustments
  6. Optional Cross/Mono Re-Ranker
  7. Final Truncation (top K)

保留 raw pre-fusion scores 供 audit。

Query Normalization

Steps:

  • Unicode normalize NFKC
  • Lowercase,但如有需要,保留 casing snapshot 供 answer formatting
  • Tokenize 並 preserve stopwords,因為 semantic embeddings 可以利用 context
  • Synonym / Alias Expansion:append alternative tokens,用於 internal product codename mapping;唔插入 model prompt,只用於 sparse retrieval。
  • Numeric & Version Extraction:捕捉 X.Y.Z patterns 做 targeted lexical scoring。

Metadata & Attribute Filters

喺 initial candidate union 之後套用 filters,可以減少 recall loss。Common fields:locale、access_tier、page_type、product_area、updated_bucket。Security filters (tenant / tier) 要喺 scoring fusion 之前 enforce,避免 leakage 影響 re-ranking。提供 debug mode 回傳 filtered_out set 供 inspection。

Re-Ranking Strategy

喺 top N (10-20) 上用 lightweight cross-encoder,例如 distilled model。如果 latency > budget,就 degrade:skip re-rank,或者降低 candidate count 同時提高 lexical weight。追蹤 re_rank_delta = MRR_post - MRR_pre 去 justify cost。對 short TTL 內 identical union sets cache re-rank results。

Freshness & Temporal Signals

freshness_weight = exp(-lambda * age_days),lambda 按 content type tune,例如 pricing 高啲、API stable 低啲。合併:final_score = w_sem * sem_score + w_lex * lex_score + w_fresh * freshness_weight + w_meta * meta_priors。先 normalize 每個 component,z-score 或 min-max,避免 dominance。

Failure Modes

FailureCauseMitigation
Popularity BiasOverweight lexical tf-idfCap term frequency contribution
Stale ResultsFreshness weight mis-tunedRecalibrate lambda using evaluation set
Locale LeakageLate filter applicationMove security filters earlier
Semantic DriftEmbedding model upgradeDual-index and A/B compare before rollout
Over-fusion NoiseUnbounded union sizeLimit union, diversity pruning

Evaluation Framework

Experiments:

  • Ablation:(vector only、lexical only、hybrid w/o rerank、full) 量度 Recall@k、MRR。
  • Fusion Weight Tuning:用 validation gold set grid search weights。
  • Latency Budget:追蹤每個 configuration 嘅 mean + P95 retrieval latency。
  • Drift:每週 monitor head vs tail queries 嘅 recall relative change。

保留帶 config hashes 嘅 evaluation manifest。

Optimization Loop

Cycle:

  1. Log retrieval traces:query、candidates、scores、source_tag。
  2. 找出 mis-hits:downstream faithfulness 低或 citation count 低 -> classify root cause,例如 missing lexical candidate、semantic false positive、stale content。
  3. 調整 weights / thresholds;跑 offline suite。
  4. 將新 fusion weights 放喺 feature flag 後面 canary。
  5. 如果 improvement statistically significant,就 promote。

重點

  • Hybrid retrieval 係一套可 tune 嘅 dials,要不斷 instrument。
  • Security & access filters 要早套用;避免 leakage 入 scoring。
  • Re-ranking 必須用可量度 MRR / Recall lift justify latency。
  • Temporal decay 防止 outdated、high-authority pages 主導結果。
  • Fusion changes 當 code 處理:version、evaluate、roll forward 或 back。