Building Your Own AI-Powered CMS (2026) — A Stack-Agnostic Architecture & Blueprint

EN · 22 ch

Chapter 14: Search & Discovery Implementation

Chapter 14 of 22 · ~15 min read

Overview

This chapter is the build guide for the discovery surface of an AI-native CMS: how a reader (and increasingly an AI agent) actually finds content. We move concretely from keyword search engines (Typesense, Meilisearch, Postgres FTS, Elasticsearch/OpenSearch) through vector search, hybrid fusion (RRF and weighted), and cross-encoder reranking, then up to "chat with the site" (RAG), recommendations, and faceting. The throughline is the indexing pipeline — how content flows from the CMS into a fresh, cheap, queryable index — because in a CMS the hard part is not the query, it is keeping the index correct as editors publish. Everything here is stack-agnostic: we compare options and call out where each one earns its place.

Content

The two-stage mental model

Modern site search is best understood as a funnel, not a single query. Every serious 2025–2026 architecture is some version of a two-stage pipeline:

Candidate retrieval (recall-optimized). Cast a wide net — pull the top ~50–200 documents using fast, cheap retrieval. This is where keyword (BM25/sparse) and vector (dense) search live. The job here is "don't lose the right answer," not "rank it #1."
Reranking (precision-optimized). Take that candidate pool and rescore it with a slower, smarter model — a cross-encoder — that reads query and document together. This is where final ordering is decided.

The reason this pattern dominates is that sparse and dense retrieval fail in complementary ways. Keyword search nails exact tokens (SKUs, product names, error codes, acronyms) but misses paraphrase ("laptop" vs "notebook"). Vector search captures meaning but hallucinates exact identifiers into "approximate nonsense" (ParadeDB's phrase). Combining them via hybrid fusion, then reranking, is what takes retrieval precision from the ~60% range of pure-vector to the ~84% range reported across multiple Postgres/RRF write-ups (jkatz05, ParadeDB, DEV community benchmarks). That delta is the entire business case for hybrid.

Stage 0: the indexing pipeline (the part everyone underestimates)

Before any query runs, content must get into an index. In a CMS this is the load-bearing wall, and it has four sub-problems.

Extraction & normalization. Strip your structured content (headless CMS JSON, Markdown, rich-text AST) into clean text plus metadata. Preserve title, section headings, URL, locale, content type, tags, author, publish date, and access level — these become both search fields and facets later.

Engine	Hosting	Vector/Hybrid	Auto-embedding	Typing/typo tolerance	Best fit
Postgres FTS + pgvector	Your DB	Yes (manual SQL + RRF)	No (you call the API)	Weak typo handling	You already run Postgres; modest catalog; want one system
Meilisearch	Self-host / Meilisearch Cloud	Hybrid built-in	Yes (can generate embeddings)	Excellent, instant	Instant-search UX, content sites, has admin dashboard
Typesense	Self-host / Typesense Cloud	Built-in vector + hybrid	Yes (S-BERT/E5, or Azure OpenAI/GCP)	Excellent, in-RAM	Sub-50ms latency, built-in RAG/conversational search, Raft clustering in OSS
Elasticsearch / OpenSearch	Self-host / Elastic Cloud / AWS	Yes (dense_vector + BM25)	Via inference API	Strong but heavier	Large scale, complex aggregations, log+search reuse
Algolia	SaaS only	NeuralSearch (keyword+vector)	Managed	Best-in-class	No-ops, e-commerce, willing to pay; NeuralSearch is top-tier ("Elevate") plan only

Model	Dimensions	Context	Price /1M tokens	Notes (2025–26)
OpenAI text-embedding-3-small	up to 1536 (Matryoshka)	8K	~$0.02	Cheap default, strong baseline
OpenAI text-embedding-3-large	256–3072 (configurable)	8K	~$0.13	Higher quality, 2–3× storage
Google text-embedding-005 / Gemini	768 / up to 3072	long	~$0.006	Cheapest production API
Cohere Embed v4	256–1536	128K	~$0.12	Long-context, multimodal, top MTEB
Voyage voyage-3-large	1024 (Matryoshka)	32K	~$0.18	Best on code/legal/medical/finance (+4–6 MTEB on domain retrieval)
BGE / open models	varies	varies	self-host	Free at inference; you run the GPU

Reranker	Type	Latency (typical)	Notable
Cohere Rerank 3.5 / Rerank 4	API	~595–603ms; +100–300ms/query	Industry standard, strong multilingual
Voyage rerank-2.5	API	~595ms; instruction-following	+7.94% over Cohere v3.5 across 93 datasets (Voyage); steerable via prompt
bge-reranker-v2-m3	Open-source cross-encoder	GPU-dependent, often faster	Free, self-host; near-Cohere accuracy
FlashRank / MiniLM cross-encoders	Open-source, tiny	Very low (CPU-capable)	Cheap, good enough for many sites
LLM-as-reranker (GPT/Claude/Gemini)	LLM	+4–6 seconds	5–8% accuracy upside on some tasks but slow + costly; Voyage argues against it for production

Building Your Own AI-Powered CMS (2026) — A Stack-Agnostic Architecture & Blueprint

Chapter 14: Search & Discovery Implementation

Overview

Content

The two-stage mental model

Stage 0: the indexing pipeline (the part everyone underestimates)

Keyword engines compared

Vector search and embedding model choice

Hybrid fusion: RRF vs weighted

Reranking: the precision stage

"Chat with the site" (RAG over your content)

Recommendations & discovery

Faceting & filtered search

Putting it together: a stack-agnostic reference build

Key Takeaways

Key References