Nabibigo ang single-modality retrieval sa edge cases: namimiss ng dense vectors ang rare tokens at IDs; namimiss ng pure lexical ang paraphrase at semantic similarity. Pinagsasanib ng hybrid retrieval ang complementary signals — dense semantic, sparse lexical, structured metadata, temporal freshness — para gumawa ng stable high-precision candidate sets. Ipinaliliwanag ng article na ito ang architecture, normalization, scoring fusion, failure handling, at evaluation.
Motivation
Failure scenarios:
- Proper nouns / SKU codes na hindi nahuli ng dense model.
- Pricing change queries na humihila ng stale snapshot dahil kulang ang temporal boost.
- Mahahabang natural questions na na-overweight sa stopwords sa sparse-only system.
- Vector false positives sa semantically broad pages, gaya ng marketing fluff, na walang lexical anchoring.
Nababawasan ito ng hybrid sa pamamagitan ng pagkuha ng magkakaibang dimensions ng evidence.
Component Layering
Recommended flow:
- Query Embedding -> ANN search (k_vec)
- Lexical Search (BM25 / SPLADE / Elasticsearch) (k_lex)
- Union -> Score Normalization (per source scaling)
- Metadata Filter Pass (locale, access_tier, page_type)
- Diversity & Freshness Adjustments
- Optional Cross/Mono Re-Ranker
- Final Truncation (top K)
Panatilihin ang raw pre-fusion scores para sa audit.
Query Normalization
Mga hakbang:
- Unicode normalize NFKC
- Lowercase, pero itago ang casing snapshot para sa answer formatting kung kailangan
- Tokenize at preserve stopwords dahil nagagamit ng semantic embeddings ang context
- Synonym / Alias Expansion: Mag-append ng alternative tokens para sa internal product codename mapping, hindi isinasama sa model prompt; ginagamit lang para sa sparse retrieval.
- Numeric & Version Extraction: Kunin ang X.Y.Z patterns para sa targeted lexical scoring.
Metadata & Attribute Filters
Ang filters na inilalapat pagkatapos ng initial candidate union ay nagpapababa ng recall loss. Common fields: locale, access_tier, page_type, product_area, updated_bucket. I-enforce ang security filters (tenant / tier) BAGO ang scoring fusion para hindi makaapekto ang leakage sa re-ranking. Magbigay ng debug mode na nagbabalik ng filtered_out set para sa inspection.
Re-Ranking Strategy
Gumamit ng lightweight cross-encoder, distilled model, sa top N (10-20). Kapag latency > budget, mag-degrade: laktawan ang re-rank o bawasan ang candidate count habang tinataasan ang lexical weight. I-track ang re_rank_delta = MRR_post - MRR_pre para ma-justify ang cost. I-cache ang re-rank results para sa identical union sets sa loob ng short TTL.
Freshness & Temporal Signals
Kuwentahin ang freshness_weight = exp(-lambda * age_days) kung saan naka-tune ang lambda per content type, mas mataas sa pricing at mas mababa sa stable API. Pagsamahin: final_score = w_sem * sem_score + w_lex * lex_score + w_fresh * freshness_weight + w_meta * meta_priors. I-normalize muna ang bawat component, z-score o min-max, para maiwasan ang dominance.
Failure Modes
| Failure | Cause | Mitigation |
|---|---|---|
| Popularity Bias | Overweight lexical tf-idf | Cap term frequency contribution |
| Stale Results | Freshness weight mis-tuned | Recalibrate lambda using evaluation set |
| Locale Leakage | Late filter application | Move security filters earlier |
| Semantic Drift | Embedding model upgrade | Dual-index and A/B compare before rollout |
| Over-fusion Noise | Unbounded union size | Limit union, diversity pruning |
Evaluation Framework
Experiments:
- Ablation: (vector only, lexical only, hybrid w/o rerank, full) sukatin ang Recall@k at MRR.
- Fusion Weight Tuning: Grid search weights gamit ang validation gold set.
- Latency Budget: I-track ang mean + P95 retrieval latency per configuration.
- Drift: I-monitor lingguhan ang relative change in recall para sa head vs tail queries.
Panatilihin ang evaluation manifest na may config hashes.
Optimization Loop
Cycle:
- I-log ang retrieval traces: query, candidates, scores, source_tag.
- Tukuyin ang mis-hits: low faithfulness downstream o low citation count -> i-classify ang root cause, gaya ng missing lexical candidate, semantic false positive, o stale content.
- I-adjust ang weights / thresholds; patakbuhin ang offline suite.
- I-canary ang bagong fusion weights sa likod ng feature flag.
- I-promote kapag statistically significant ang improvement.
Mahahalagang Punto
- Ang hybrid retrieval ay system ng tunable dials; i-instrument nang walang tigil.
- Ilapat nang maaga ang security at access filters; iwasan ang leakage sa scoring.
- Kailangang ma-justify ng re-ranking ang latency gamit ang measurable MRR / Recall lift.
- Pinipigilan ng temporal decay ang outdated, high-authority pages na mangibabaw.
- Tratuhin ang fusion changes na parang code: version, evaluate, roll forward o back.