Hybrid Retrieval: Vector + Keyword + Metadata

Nabibigo ang single-modality retrieval sa edge cases: namimiss ng dense vectors ang rare tokens at IDs; namimiss ng pure lexical ang paraphrase at semantic similarity. Pinagsasanib ng hybrid retrieval ang complementary signals — dense semantic, sparse lexical, structured metadata, temporal freshness — para gumawa ng stable high-precision candidate sets. Ipinaliliwanag ng article na ito ang architecture, normalization, scoring fusion, failure handling, at evaluation.

Motivation

Failure scenarios:

Proper nouns / SKU codes na hindi nahuli ng dense model.
Pricing change queries na humihila ng stale snapshot dahil kulang ang temporal boost.
Mahahabang natural questions na na-overweight sa stopwords sa sparse-only system.
Vector false positives sa semantically broad pages, gaya ng marketing fluff, na walang lexical anchoring.

Nababawasan ito ng hybrid sa pamamagitan ng pagkuha ng magkakaibang dimensions ng evidence.

Component Layering

Recommended flow:

Query Embedding -> ANN search (k_vec)
Lexical Search (BM25 / SPLADE / Elasticsearch) (k_lex)
Union -> Score Normalization (per source scaling)
Metadata Filter Pass (locale, access_tier, page_type)
Diversity & Freshness Adjustments
Optional Cross/Mono Re-Ranker
Final Truncation (top K)

Panatilihin ang raw pre-fusion scores para sa audit.

Query Normalization

Mga hakbang:

Unicode normalize NFKC
Lowercase, pero itago ang casing snapshot para sa answer formatting kung kailangan
Tokenize at preserve stopwords dahil nagagamit ng semantic embeddings ang context
Synonym / Alias Expansion: Mag-append ng alternative tokens para sa internal product codename mapping, hindi isinasama sa model prompt; ginagamit lang para sa sparse retrieval.
Numeric & Version Extraction: Kunin ang X.Y.Z patterns para sa targeted lexical scoring.

Ang filters na inilalapat pagkatapos ng initial candidate union ay nagpapababa ng recall loss. Common fields: locale, access_tier, page_type, product_area, updated_bucket. I-enforce ang security filters (tenant / tier) BAGO ang scoring fusion para hindi makaapekto ang leakage sa re-ranking. Magbigay ng debug mode na nagbabalik ng filtered_out set para sa inspection.

Re-Ranking Strategy

Gumamit ng lightweight cross-encoder, distilled model, sa top N (10-20). Kapag latency > budget, mag-degrade: laktawan ang re-rank o bawasan ang candidate count habang tinataasan ang lexical weight. I-track ang re_rank_delta = MRR_post - MRR_pre para ma-justify ang cost. I-cache ang re-rank results para sa identical union sets sa loob ng short TTL.

Freshness & Temporal Signals

Kuwentahin ang freshness_weight = exp(-lambda * age_days) kung saan naka-tune ang lambda per content type, mas mataas sa pricing at mas mababa sa stable API. Pagsamahin: final_score = w_sem * sem_score + w_lex * lex_score + w_fresh * freshness_weight + w_meta * meta_priors. I-normalize muna ang bawat component, z-score o min-max, para maiwasan ang dominance.

Failure Modes

Failure	Cause	Mitigation
Popularity Bias	Overweight lexical tf-idf	Cap term frequency contribution
Stale Results	Freshness weight mis-tuned	Recalibrate lambda using evaluation set
Locale Leakage	Late filter application	Move security filters earlier
Semantic Drift	Embedding model upgrade	Dual-index and A/B compare before rollout
Over-fusion Noise	Unbounded union size	Limit union, diversity pruning

Evaluation Framework

Experiments:

Ablation: (vector only, lexical only, hybrid w/o rerank, full) sukatin ang Recall@k at MRR.
Fusion Weight Tuning: Grid search weights gamit ang validation gold set.
Latency Budget: I-track ang mean + P95 retrieval latency per configuration.
Drift: I-monitor lingguhan ang relative change in recall para sa head vs tail queries.

Panatilihin ang evaluation manifest na may config hashes.

Optimization Loop

Cycle:

I-log ang retrieval traces: query, candidates, scores, source_tag.
Tukuyin ang mis-hits: low faithfulness downstream o low citation count -> i-classify ang root cause, gaya ng missing lexical candidate, semantic false positive, o stale content.
I-adjust ang weights / thresholds; patakbuhin ang offline suite.
I-canary ang bagong fusion weights sa likod ng feature flag.
I-promote kapag statistically significant ang improvement.

Mahahalagang Punto

Ang hybrid retrieval ay system ng tunable dials; i-instrument nang walang tigil.
Ilapat nang maaga ang security at access filters; iwasan ang leakage sa scoring.
Kailangang ma-justify ng re-ranking ang latency gamit ang measurable MRR / Recall lift.
Pinipigilan ng temporal decay ang outdated, high-authority pages na mangibabaw.
Tratuhin ang fusion changes na parang code: version, evaluate, roll forward o back.