This chapter is the build guide for the discovery surface of an AI-native CMS: how a reader (and increasingly an AI agent) actually finds content. We move concretely from keyword search engines (Typesense, Meilisearch, Postgres FTS, Elasticsearch/OpenSearch) through vector search, hybrid fusion (RRF and weighted), and cross-encoder reranking, then up to "chat with the site" (RAG), recommendations, and faceting. The throughline is the indexing pipeline — how content flows from the CMS into a fresh, cheap, queryable index — because in a CMS the hard part is not the query, it is keeping the index correct as editors publish. Everything here is stack-agnostic: we compare options and call out where each one earns its place.
Content
The two-stage mental model
Modern site search is best understood as a funnel, not a single query. Every serious 2025–2026 architecture is some version of a two-stage pipeline:
Candidate retrieval (recall-optimized). Cast a wide net — pull the top ~50–200 documents using fast, cheap retrieval. This is where keyword (BM25/sparse) and vector (dense) search live. The job here is "don't lose the right answer," not "rank it #1."
Reranking (precision-optimized). Take that candidate pool and rescore it with a slower, smarter model — a cross-encoder — that reads query and document together. This is where final ordering is decided.
The reason this pattern dominates is that sparse and dense retrieval fail in complementary ways. Keyword search nails exact tokens (SKUs, product names, error codes, acronyms) but misses paraphrase ("laptop" vs "notebook"). Vector search captures meaning but hallucinates exact identifiers into "approximate nonsense" (ParadeDB's phrase). Combining them via hybrid fusion, then reranking, is what takes retrieval precision from the ~60% range of pure-vector to the ~84% range reported across multiple Postgres/RRF write-ups (jkatz05, ParadeDB, DEV community benchmarks). That delta is the entire business case for hybrid.
Stage 0: the indexing pipeline (the part everyone underestimates)
Before any query runs, content must get into an index. In a CMS this is the load-bearing wall, and it has four sub-problems.
Extraction & normalization. Strip your structured content (headless CMS JSON, Markdown, rich-text AST) into clean text plus metadata. Preserve title, section headings, URL, locale, content type, tags, author, publish date, and access level — these become both search fields and facets later.
Chunking (for vector/RAG only — keyword search indexes whole docs). The consensus 2025 recipe: split on semantic boundaries (H2/H3 headings), target 200–800 tokens per chunk, and prepend the document title and section heading into each chunk so an isolated chunk still carries context. Over-large chunks dilute the embedding; tiny chunks lose context. For a CMS, heading-based ("Markdown-aware") chunking beats fixed-size character splitting because your editors already wrote the structure.
Embedding. Each chunk is sent to an embedding model. Batch these calls — embedding APIs bill per token and reward batching.
Freshness & incremental updates. This is the make-or-break operational concern. A full reindex of a real docs site (~12,000 files) costs roughly 22 minutes and ~$8.50 in embedding API calls; an incremental update of the 10 files that actually changed costs ~45 seconds and ~$0.07 (figures from a 2025 incremental-indexing write-up). The architecture that achieves this:
Webhook-driven, not cron-driven. Wire the CMS publish/update/delete events to a webhook that enqueues an indexing job. Editors expect their change to be searchable in seconds, not at the next nightly batch.
Content hashing for idempotency. Store a content_hash (or updatedAt version) per chunk. On a publish event, re-embed only chunks whose hash changed. Unchanged chunks skip the embedding call entirely — this is where most of the cost savings come from.
Make the pipeline idempotent. A webhook can fire twice; a queue can redeliver. Upsert by stable chunk ID so replays are harmless.
Handle deletes and unpublishes. A draft reverted to private must leave the index immediately, or you leak unpublished content into search and into the RAG answer box.
For larger or multi-source systems, change-data-capture (CDC) tools (Debezium, or managed connectors via Airbyte/Fivetran) stream row-level changes within seconds, replacing batch syncs that miss updates between runs.
Keyword engines compared
For most CMS projects you will pick exactly one keyword/primary engine. Here is the 2026 landscape.
Engine
Hosting
Vector/Hybrid
Auto-embedding
Typing/typo tolerance
Best fit
Postgres FTS + pgvector
Your DB
Yes (manual SQL + RRF)
No (you call the API)
Weak typo handling
You already run Postgres; modest catalog; want one system
Meilisearch
Self-host / Meilisearch Cloud
Hybrid built-in
Yes (can generate embeddings)
Excellent, instant
Instant-search UX, content sites, has admin dashboard
Typesense
Self-host / Typesense Cloud
Built-in vector + hybrid
Yes (S-BERT/E5, or Azure OpenAI/GCP)
Excellent, in-RAM
Sub-50ms latency, built-in RAG/conversational search, Raft clustering in OSS
Elasticsearch / OpenSearch
Self-host / Elastic Cloud / AWS
Yes (dense_vector + BM25)
Via inference API
Strong but heavier
Large scale, complex aggregations, log+search reuse
Algolia
SaaS only
NeuralSearch (keyword+vector)
Managed
Best-in-class
No-ops, e-commerce, willing to pay; NeuralSearch is top-tier ("Elevate") plan only
Key architectural differences worth knowing:
Typesense keeps the entire index in RAM, which is why sub-50ms is normal — but you size servers by index size. It ships built-in vector/semantic search (S-BERT, E5), conversational search with built-in RAG, and Raft-based clustering in the open-source build. It is API-first with no admin UI.
Meilisearch uses LMDB (memory-mapped on disk), so the OS pages in only hot data — gentler on RAM, slightly different latency profile. Its differentiator is a built-in admin dashboard for browsing indexes and tuning settings visually. Horizontal sharding exists but is Enterprise-Edition; OSS is single-node.
Postgres FTS (tsvector/tsquery) is "functional for basic text matching" but its ranking is weak and typo tolerance is poor. Its superpower is that it lives in the same transaction as your content, so consistency is free. For BM25-quality lexical scoring in Postgres, the ParadeDB extension (pg_search) adds a real BM25 index; for billion-scale vectors, pgvectorscale adds StreamingDiskANN (disk-resident graph index, hot part in RAM).
Algolia is pure SaaS, fastest to ship, with the best instant-search and faceting ergonomics — but it is the priciest, NeuralSearch is gated to the top plan, and you cannot self-host.
Decision heuristic: If you already run Postgres and your catalog is under a few million rows, start with Postgres FTS + pgvector + RRF — one system, transactional freshness, no extra ops. If you need instant-search UX or sub-50ms at scale, reach for Typesense or Meilisearch. Choose Algolia when you want zero search ops and will pay for it; choose Elasticsearch/OpenSearch when you also need heavy aggregations/logging or are already invested in it.
Vector search and embedding model choice
Vector search turns the query and each chunk into high-dimensional vectors and finds nearest neighbors (cosine/dot-product), usually via an HNSW or DiskANN index. The quality ceiling is set by your embedding model.
Model
Dimensions
Context
Price /1M tokens
Notes (2025–26)
OpenAI text-embedding-3-small
up to 1536 (Matryoshka)
8K
~$0.02
Cheap default, strong baseline
OpenAI text-embedding-3-large
256–3072 (configurable)
8K
~$0.13
Higher quality, 2–3× storage
Google text-embedding-005 / Gemini
768 / up to 3072
long
~$0.006
Cheapest production API
Cohere Embed v4
256–1536
128K
~$0.12
Long-context, multimodal, top MTEB
Voyage voyage-3-large
1024 (Matryoshka)
32K
~$0.18
Best on code/legal/medical/finance (+4–6 MTEB on domain retrieval)
BGE / open models
varies
varies
self-host
Free at inference; you run the GPU
Practical guidance:
Dimensions are a storage/quality knob. 768–1024 dims serve most sites well. Matryoshka models (OpenAI, Cohere, Voyage) let you truncate to 256/512 with minor quality loss — halve your vector storage and speed up search for a small recall hit. A 3072-dim model costs 3× the storage of a 1024-dim one.
Match the model to the corpus. Generic blog/marketing content: text-embedding-3-small or Google's model. Technical docs, code, legal, or financial content: Voyage measurably wins. Long documents you don't want to chunk aggressively: Cohere Embed v4's 128K window.
Pin the model version. Embeddings from different models (or even versions) are not comparable. Re-embedding the whole corpus on a model change is a deliberate, budgeted migration — store the model name alongside each vector.
Hybrid fusion: RRF vs weighted
Once you have a sparse list (BM25) and a dense list (vector), you must merge them. Two methods dominate.
Reciprocal Rank Fusion (RRF). Ignore the raw scores; use only rank position. For each document, sum 1 / (k + rank) across each result list (k≈60 is the standard constant), then sort by that sum. RRF is the robust default because it sidesteps the impossible problem of comparing a BM25 score (unbounded) with a cosine similarity (0–1) — they live on incomparable scales. Azure AI Search, Supabase's hybrid recipe, and most Postgres examples all default to RRF.
Weighted (convex) fusion. Normalize both scores to 0–1, then final = α·dense + (1−α)·sparse. This lets you tune the balance (e.g., α=0.7 leans semantic). More powerful when you know your corpus, but fragile — you have to re-tune α and re-normalize when distributions shift. Typesense and Meilisearch expose a hybrid weight along these lines.
Rule of thumb: ship RRF first (robust, no tuning), move to weighted fusion only if you have an eval set proving a specific α beats RRF on your queries.
A minimal Postgres hybrid query is roughly: run a ts_rank/pg_search BM25 query and a pgvector<=> cosine query as two CTEs, assign row_number() ranks in each, full-outer-join on document id, and order by the RRF sum. Supabase ships this as a documented SQL function.
Reranking: the precision stage
Reranking takes the fused top-N (say 50–100) and rescores with a cross-encoder — a model that reads query+document jointly, far more accurate than the bi-encoder embeddings used for retrieval (which encode each side independently). This is consistently the highest-ROI quality upgrade in retrieval.
Reranker
Type
Latency (typical)
Notable
Cohere Rerank 3.5 / Rerank 4
API
~595–603ms; +100–300ms/query
Industry standard, strong multilingual
Voyage rerank-2.5
API
~595ms; instruction-following
+7.94% over Cohere v3.5 across 93 datasets (Voyage); steerable via prompt
bge-reranker-v2-m3
Open-source cross-encoder
GPU-dependent, often faster
Free, self-host; near-Cohere accuracy
FlashRank / MiniLM cross-encoders
Open-source, tiny
Very low (CPU-capable)
Cheap, good enough for many sites
LLM-as-reranker (GPT/Claude/Gemini)
LLM
+4–6 seconds
5–8% accuracy upside on some tasks but slow + costly; Voyage argues against it for production
Notes from 2025–26 vendor and third-party benchmarks:
The new capability in 2025 is instruction-following reranking (Voyage rerank-2.5): you prepend a natural-language instruction ("prefer recent posts," "favor how-to guides") to steer relevance.
Voyage reports rerank-2.5 is 9× / 36× / 48× faster than using Claude Sonnet 4.5 / GPT-5 / Gemini 2.5 Pro as rerankers — quantifying why dedicated cross-encoders beat LLM reranking for latency-sensitive site search.
Open-source bge-reranker-v2-m3 matches Cohere closely on NDCG@10; the real differentiator is the operating model (hosted dependency vs. you running a GPU), not raw accuracy.
When to skip reranking: small corpora where hybrid alone is already precise, or hard latency budgets under ~150ms with no GPU. Otherwise, a reranker on the top-50 is the single best quality lever.
"Chat with the site" (RAG over your content)
Conversational search is hybrid retrieval + reranking with an LLM generation step bolted on the end. The pipeline:
Query understanding — optionally rewrite the user's question (expand acronyms, resolve "it"/"that" against chat history) before retrieval.
Retrieve — hybrid search over your chunked content → top ~50.
Rerank → top 5–10 chunks.
Generate — feed those chunks + the question to an LLM with a grounding prompt: answer only from the provided context; cite sources; say "I don't know" if not covered.
Cite — return the source URLs/titles so the answer is verifiable and clickable back into the CMS.
CMS-specific disciplines that separate a working "chat with the site" from a liability:
Reuse the exact same index as site search. One chunked, embedded, freshness-managed corpus serves both the search box and the chat box. Do not build a second pipeline.
Respect access control at retrieval time. Filter chunks by the requesting user's permissions and by published status before the LLM sees them. A RAG bot that quotes an unpublished draft is a content leak.
Grounding + refusal beats fluency. The Forbes-cited 2025 retailer saw a 25% engagement lift from RAG-driven search/recommendations — but the cost of a confident wrong answer is brand damage. Force citations; constrain to retrieved context.
The LLM completion, not the rerank, is the dominant cost. Cache answers for common queries; use a small/cheap model for query rewriting and a stronger one only for final generation.
Recommendations & discovery
Search is pull; recommendations are push. Three buildable tiers:
Content-based ("more like this"). Reuse your embeddings: nearest-neighbor on the current article's vector gives "related articles" for free, with zero extra infrastructure. This is the highest-leverage, lowest-cost win in a CMS.
Behavioral / collaborative. "People who read this also read…" from click co-occurrence. Algolia Recommend ships four prebuilt models (related products, frequently-bought-together, trending, looking-similar) if you don't want to build it; you feed it front-end events.
Personalized ranking. Build per-user affinity profiles from interaction history (clicks, reads, purchases) and weight search results by them. Algolia Personalization and Elastic both expose this; it requires reliable front-end event capture and raises privacy/consent obligations.
Start with content-based (you already have the vectors), add behavioral once you have traffic, and treat personalization as a later, consent-gated phase.
Faceting & filtered search
Facets are the filterable categories (type, tag, author, date, price) with live result counts that let users drill down. Implementation notes:
Index facets as filterable/faceted attributes — every engine above supports this (Algolia facets with distribution counts, Typesense/Meilisearch filterable attributes, Postgres via indexed columns + GROUP BY counts, Elasticsearch aggregations).
Filter must apply at retrieval, including in vector search. pgvector and the dedicated engines support metadata pre/post-filtering on the vector query — critical so "semantic search within type:tutorial and locale:en" returns correct counts and results.
Counts are a UX contract. If a facet says "Tutorials (42)," clicking it must yield 42. Compute counts from the same filtered candidate set, post-fusion, to avoid the classic mismatch.
Putting it together: a stack-agnostic reference build
Index: Webhook on CMS publish → queue → extract + normalize → heading-based chunking (200–800 tokens, title/heading prepended) → hash-diff → embed only changed chunks → upsert into engine with full metadata + access level.
Surface: search results page (with facets), "related" via vector kNN, and a RAG chat box that reuses (2)+(3) and adds grounded generation with citations.
Govern: access-control filter at retrieval, delete-on-unpublish, pinned embedding model version, eval set to justify any tuning.
This same skeleton runs on Postgres-only (cheapest, one system), Typesense/Meilisearch (instant UX), or Algolia/Elastic (managed scale) — only the engine swaps; the architecture holds.
Key Takeaways
Treat search as a two-stage funnel: wide hybrid retrieval (recall) → cross-encoder rerank (precision). Hybrid alone lifts retrieval precision from ~60% (pure vector) to ~84%.
The indexing pipeline is the hard part in a CMS: webhook-driven incremental updates with content hashing turn an $8.50/22-minute full reindex into a $0.07/45-second incremental one.
Sparse and dense search fail in complementary ways — keyword for exact tokens/IDs, vectors for meaning — which is exactly why hybrid + RRF wins.
RRF is the robust default fusion (rank-based, no scale problem); switch to weighted (α-tuned) only when an eval set proves it helps.
Pick the engine to the constraint: Postgres+pgvector (one system, transactional), Typesense/Meilisearch (sub-50ms instant UX), Algolia (no-ops, premium), Elastic/OpenSearch (scale + aggregations).
Match the embedding model to the corpus (Voyage for code/legal/medical, Cohere v4 for long context, OpenAI/Google for cheap general) and pin the version; Matryoshka truncation trades storage for minor quality.
Reranking is the top quality lever — Cohere/Voyage APIs (~600ms) or self-hosted bge-reranker-v2-m3; avoid LLM-as-reranker for latency-sensitive search (+4–6s).
"Chat with the site" reuses the same index as search, must filter by access control before generation, and must cite sources and refuse when ungrounded.
Recommendations start free: nearest-neighbor on existing embeddings gives "related content"; add behavioral and personalized tiers later, consent-gated.