Ruka hadi kwenye maudhui

Hybrid retrieval: vector + keyword + metadata

Uhandisi wa hybrid retrieval imara: kuchanganya vector, lexical, metadata na ishara za muda kwa website RAG.

retrieval • hybrid • search • rag

Retrieval ya modality moja hushindwa kwenye edge cases: dense vectors hukosa tokens adimu na IDs; lexical pekee hukosa paraphrase na ufanano wa semantic. Hybrid retrieval huunganisha ishara zinazokamilishana - dense semantic, sparse lexical, metadata iliyopangwa, na freshness ya muda - ili kutoa candidate sets thabiti na zenye precision ya juu. Makala hii inaeleza architecture, normalization, scoring fusion, failure handling, na evaluation.

Motisha

Mifano ya kushindwa:

  • Majina maalum au SKU codes hukosekana kwa dense model.
  • Queries za mabadiliko ya bei huchukua snapshot iliyochakaa bila temporal boost.
  • Maswali marefu ya kawaida hupewa uzito mkubwa kupita kiasi kwenye stopwords katika mfumo wa sparse pekee.
  • Vector false positives kwenye kurasa pana za marketing zisizo na lexical anchoring.

Hybrid hupunguza hili kwa kukamata vipimo huru vya ushahidi.

Mpangilio wa components

Mtiririko unaopendekezwa:

  1. Query Embedding -> ANN search (k_vec)
  2. Lexical Search (BM25 / SPLADE / Elasticsearch) (k_lex)
  3. Union -> Score Normalization (per source scaling)
  4. Metadata Filter Pass (locale, access_tier, page_type)
  5. Diversity & Freshness Adjustments
  6. Optional Cross/Mono Re-Ranker
  7. Final Truncation (top K)

Dumisha raw pre-fusion scores kwa ajili ya audit.

Query normalization

Hatua:

  • Unicode normalize NFKC
  • Lowercase, ukihifadhi casing snapshot kwa answer formatting ikiwa inahitajika
  • Tokenize na uhifadhi stopwords, kwa sababu semantic embeddings zinaweza kutumia context
  • Synonym / Alias Expansion: ongeza tokens mbadala kwa mapping ya internal product codenames, si kuziingiza kwenye model prompt bali kuzitumia kwa sparse retrieval pekee
  • Numeric & Version Extraction: kamata patterns za X.Y.Z kwa lexical scoring inayolengwa

Metadata na attribute filters

Filters zinazotumika baada ya union ya mwanzo ya candidates hupunguza hasara ya recall. Sehemu za kawaida: locale, access_tier, page_type, product_area, updated_bucket. Tekeleza security filters (tenant / tier) KABLA ya scoring fusion ili kuzuia leakage kuathiri re-ranking. Toa debug mode inayorudisha filtered_out set kwa ukaguzi.

Mkakati wa re-ranking

Tumia cross-encoder nyepesi (distilled model) juu ya top N (10-20). Ikiwa latency > budget, degrade: ruka re-rank AU punguza candidate count huku ukiongeza lexical weight. Fuatilia re_rank_delta = MRR_post - MRR_pre ili kuhalalisha gharama. Cache matokeo ya re-rank kwa union sets zinazofanana ndani ya TTL fupi.

Freshness na temporal signals

Hesabu freshness_weight = exp(-lambda * age_days) ambapo lambda hutunwa kwa kila aina ya maudhui (pricing juu, API stable chini). Changanya: final_score = w_sem * sem_score + w_lex * lex_score + w_fresh * freshness_weight + w_meta * meta_priors. Normalize kila component kwanza (z-score au min-max) ili kuepuka dominance.

Failure modes

FailureCauseMitigation
Popularity BiasLexical tf-idf imepewa uzito mwingiWeka kikomo kwenye mchango wa term frequency
Stale ResultsFreshness weight imetunwa vibayaTunza upya lambda kwa kutumia evaluation set
Locale LeakageFilter inatumika kuchelewaSogeza security filters mapema
Semantic DriftEmbedding model imeboreshwaDual-index na A/B compare kabla ya rollout
Over-fusion NoiseUkubwa wa union hauna kikomoWeka limit ya union na diversity pruning

Evaluation framework

Majaribio:

  • Ablation: (vector only, lexical only, hybrid bila rerank, full) pima Recall@k, MRR.
  • Fusion Weight Tuning: grid search ya weights kwa kutumia validation gold set.
  • Latency Budget: fuatilia mean + P95 retrieval latency kwa kila configuration.
  • Drift: fuatilia mabadiliko ya kila wiki ya recall kwa head dhidi ya tail queries.

Dumisha evaluation manifest yenye config hashes.

Optimization loop

Mzunguko:

  1. Rekodi retrieval traces (query, candidates, scores, source_tag).
  2. Tambua mis-hits (faithfulness ndogo downstream au citation count ndogo) -> ainisha root cause (lexical candidate kukosekana, semantic false positive, stale content).
  3. Rekebisha weights / thresholds; endesha offline suite.
  4. Canary fusion weights mpya nyuma ya feature flag.
  5. Promote ikiwa kuna improvement yenye maana ya kitakwimu.

Mambo muhimu

  • Hybrid retrieval ni mfumo wa dials zinazotunwa: fanya instrumentation bila kuchoka.
  • Tumia security na access filters mapema; epuka leakage kwenye scoring.
  • Re-ranking lazima ihakikishe latency kwa MRR / Recall lift inayopimika.
  • Temporal decay huzuia kurasa za zamani zenye authority kubwa kutawala.
  • Shughulikia fusion changes kama code: version, evaluate, roll forward au back.