Building Your Own AI-Powered CMS (2026) — A Stack-Agnostic Architecture & Blueprint

EN · 22 ch

Chapter 12: Retrieval & Knowledge Layer (RAG + MCP)

Chapter 12 of 22 · ~16 min read

Overview

This chapter covers the part of an AI-native CMS that turns your content into something a model can ground itself in: embeddings, the vector store, how to chunk your own structured/unstructured content, the retrieval pipeline (hybrid search + reranking + contextual retrieval) that makes generation factual, and the Model Context Protocol (MCP) server that lets editors, agents, and external assistants query your CMS as a first-class data source. It is deliberately stack-agnostic — comparing pgvector, Pinecone, Qdrant, Weaviate, and Turbopuffer, and contrasting embedding vendors — while giving concrete defaults, numbers, and the pitfalls that bite teams in production.

Content

1. Where the knowledge layer sits in an AI-native CMS

In a 2026 AI-native CMS, the retrieval layer is the bridge between your canonical content (the database of record — entries, fields, media, taxonomy) and any generative surface (an editor assistant, an on-site answer box, a chatbot, an internal knowledge agent, an MCP-connected IDE). The job is narrow and important: given a question or task, return the smallest set of trustworthy, current, permission-aware content snippets that let a model answer without hallucinating.

A clean mental model has four moving parts:

Ingestion & chunking — break content into retrievable units and enrich them.
Embedding — turn each unit into a vector (and keep the text for lexical search).
Vector + lexical store — index for fast filtered nearest-neighbour and keyword search.
Retrieval orchestration — query rewriting, hybrid search, reranking, and (optionally) an agentic control loop, surfaced to apps directly and to agents over MCP.

The cardinal rule: the vector store is a derived index, never the source of truth. Your CMS database owns the content; embeddings are a cache you can always rebuild. This single decision keeps re-embedding (new model, new chunking) a background job rather than a migration crisis.

2. Embeddings: choosing the model

An embedding model maps text to a fixed-length float vector where semantic similarity ≈ geometric proximity. For a CMS, the choice trades off retrieval quality, dimension count (storage + latency), multilingual coverage, and cost. The 2026 landscape:

Model (2026)	Dims (default / min)	~Price /1M tokens	Notes
OpenAI `text-embedding-3-small`	1536 / 512	~$0.02	Cheapest safe default; Matryoshka truncation
OpenAI `text-embedding-3-large`	3072 / 256	~$0.13	Strong general MTEB; truncate to 1024 for storage savings
Voyage `voyage-4-large` (Jan 2026, MoE)	up to 2048	~$0.18	Top RTEB scores; ~14% over OpenAI-3-large on NDCG@10 per Voyage
Voyage `voyage-4-lite`	1024	~$0.06	Same vector space as large → embed docs large, query lite, no re-index
Cohere `embed-v4.0`	1536 (configurable)	~$0.12	Best-in-class multilingual; pairs with Cohere Rerank
BGE-M3 / Jina v3 (open weights)	1024	self-host	No per-token cost; multilingual; good for privacy/air-gap

Store	Model	Best at	Watch-outs
pgvector (0.8.x, Postgres ext.)	self-host / any managed PG	Vectors next to your content; one DB, transactional, joins	Memory-bound past ~50–100M vectors without pgvectorscale
Qdrant	OSS + managed cloud	Fastest filtered search (p50 < 5 ms), payload filters in the index; 1 GB free	Separate service to sync & operate
Weaviate	OSS + managed	Native hybrid search, rich schema, good DX	Heavier; another system of record temptation
Pinecone (serverless)	managed only	Zero ops, scales past 100M	Proprietary; less recall tuning; serverless adds latency
Turbopuffer	managed, S3-first	Cheapest at scale (~$70/TB/mo vs ~$1,600/TB RAM); native hybrid	$64/mo floor; cold-tier latency on rarely-read data

query → (LLM) query rewrite/expand
      → hybrid search: BM25 + vector, RRF fuse, filter by locale/status/role → top 30
      → cross-encoder rerank → top 6
      → assemble prompt with parent-chunk context + citations
      → generate

MCP primitive	What it is	CMS mapping
Resources	Read-only data the model/user can load	A published entry by ID/slug; a media asset; a rendered page; a taxonomy term
Tools	Functions the model can invoke	`search_content(query, filters)`, `get_entry(id)`, `list_entries(type, filter)`, `create_draft(...)`, `find_similar(entry_id)`
Prompts	Reusable templated workflows	"Draft a release note from these entries", "Localize this entry to {locale}"

Building Your Own AI-Powered CMS (2026) — A Stack-Agnostic Architecture & Blueprint

Chapter 12: Retrieval & Knowledge Layer (RAG + MCP)

Overview

Content

1. Where the knowledge layer sits in an AI-native CMS

2. Embeddings: choosing the model

3. Vector store choice

pgvector indexing, briefly

4. Chunking your own content

Chunk along the content model, not blindly by character count

Three upgrades that consistently help

Always store rich chunk metadata

5. The retrieval pipeline: hybrid search + reranking

6. Traditional vs agentic retrieval

7. An MCP server over your CMS

MCP server vs. plain REST/GraphQL API — when to ship which

8. The internal knowledge base angle

9. Pitfalls to avoid

Key Takeaways

Key References