Building Your Own AI-Powered CMS (2026) — A Stack-Agnostic Architecture & Blueprint

EN · 22 ch

Chapter 17: Migrating 1000+ Pages off WordPress

Chapter 17 of 22 · ~18 min read

Overview

This chapter is the field manual for moving a real, messy, decade-old WordPress estate — 1,000+ pages, tens of thousands of media files, custom post types, hand-built page-builder layouts, and accumulated SEO equity — into your new AI-native CMS without losing content, traffic, or sleep. We treat migration as an ETL pipeline (extract → parse → transform → validate → load) wrapped in an SEO/operations envelope (URL mapping, redirects, QA diffing, phased cutover, rollback). The hard parts are never the happy-path posts; they are the long tail of malformed HTML, serialized post-meta, orphaned media, and bespoke shortcodes. Everything below is stack-agnostic: the target could be Sanity, Strapi 5, Payload, Contentful, a Postgres-backed custom CMS, or a Git-based store — the pipeline shape is the same.

Content

17.1 The mental model: migration is ETL with an SEO envelope

A 1,000-page migration is not a "press export, press import" task. At scale, the WordPress built-in export (Tools → Export, producing a WXR XML file) is fine for a 30-minute blog migration but breaks down for large, custom estates (DreamHost KB; thepublive publisher guide). Treat it as a repeatable, idempotent pipeline you can run dozens of times against staging before you run it once against production:

[WordPress source]
   │  (1) EXTRACT  — WXR XML + REST API + direct DB read + media files
   ▼
[Raw mirror on disk]  ← canonical, version-controlled, re-runnable input
   │  (2) PARSE    — XML/HTML/JSON/URL parsers; normalize to an intermediate model
   ▼
[Intermediate Representation (IR) — JSON, one record per entity]
   │  (3) TRANSFORM — HTML → structured content (blocks/Portable Text/MDX), LLM-assisted
   ▼
[Target-shaped documents, schema-valid]
   │  (4) VALIDATE — schema, links, media refs, taxonomy, diff vs source
   ▼
[Target CMS ingest API]  ← idempotent upserts keyed by a stable legacy ID

The two rules that save the project: (a) every stage writes to disk so you can re-run any later stage without re-hitting WordPress, and (b) every target document carries a legacyId (the WP post ID) so the load step is an idempotent upsert, never a blind create. You will run the loader 20+ times.

17.2 Stage 1 — Extract: WXR, REST, and the database (use all three)

There is no single complete source of truth in WordPress. Use the right extractor per data class:

Source	Best for	Watch out for
WXR XML export (`Tools → Export`, or WP-CLI `wp export`)	Posts, pages, CPTs, taxonomy terms, authors, comments, structure	Media is referenced, not embedded — only attachment URLs, not files (WordPress/data-liberation Discussion #56). Serialized post-meta can be double-serialized/corrupted (Trac #10619). Large sites may need `--max_file_size` chunking.
REST API (`/wp-json/wp/v2/posts?per_page=100&page=N`, plus `/pages`, `/media`, custom `/wp/v2/<cpt>`)	Rendered HTML, ACF/meta via plugins (`acf-to-rest-api`), predictable JSON, public + (authenticated) private content (Sanity Learn)	Default page cap (`per_page` max 100; paginate). `rendered` HTML runs shortcodes/blocks through filters — good for fidelity, bad if you want raw. Custom fields hidden unless `show_in_rest` is true.
Direct DB read (mysqldump or read-only SQL on `wp_posts`, `wp_postmeta`, `wp_terms`, `wp_term_relationships`)	Ground truth, raw `post_content`, every meta row, draft/trash states, exact timestamps	Serialized PHP in `meta_value`; the page-builder soup (Elementor/Divi/WPBakery store layout as serialized blobs or shortcodes).

// raw → IR, one file per entity: ir/post-1234.json
{
  "legacyId": 1234,
  "type": "post",                // post | page | cpt:product | ...
  "status": "publish",
  "slug": "how-we-cut-latency",
  "legacyUrl": "/2019/04/how-we-cut-latency/",
  "title": "How we cut latency",
  "html": "<!-- raw post_content / rendered HTML -->",
  "excerpt": "...",
  "author": "jdoe",
  "publishedAt": "2019-04-03T09:00:00Z",
  "modifiedAt": "2023-11-02T14:21:00Z",
  "parent": 12,
  "terms": { "category": ["engineering"], "post_tag": ["performance","cdn"] },
  "meta": { "acf_hero_subtitle": "...", "_yoast_wpseo_metadesc": "..." },
  "mediaRefs": [ {"legacyId": 5567, "url": "https://old.site/wp-content/uploads/2019/04/graph.png"} ]
}

HTML / WP construct	Target block
`<p>`, `<h2>`–`<h4>`, `<ul>/<ol>`, `<blockquote>`	corresponding text blocks (lossless)
`<img>` / WP image block	`image` block with asset ref + alt/caption
`<figure><img><figcaption>` / `[caption]`	`image` block, caption preserved
`<table>`	`table` block
`<iframe src="youtube...">`, oEmbed	`embed`/`video` block
`[gallery ids="..."]`	`gallery` block resolving to migrated assets
`<pre><code>`	`code` block (preserve language hint)

Pitfall	Why it bites at 1,000+ pages	Mitigation
Page-builder content (Elementor/Divi/WPBakery)	Layout stored as serialized blobs/shortcodes, not semantic HTML	Render via REST (`rendered`) to get real HTML, or write per-builder shortcode handlers; LLM-fallback the residue
Double-serialized post-meta	Corrupted custom fields on import (Trac #10619)	Read meta from DB, unserialize defensively, validate before trusting
Missing media in WXR	Files referenced, not exported (data-liberation)	Separate media pipeline via `/wp/v2/media` + download
Hard-coded old-domain URLs in content	Site silently hot-links the dead origin	Post-upload find-replace using the media/URL maps
LLM rewriting/dropping text	Invisible content loss across hundreds of docs	Schema-constrained output + verbatim instruction + text-fidelity diff gate
Lost redirects for archive/feed/attachment URLs	Backlinks 404, equity evaporates	Full crawl + sitemap to enumerate every old URL, not just posts
Non-idempotent loader	Re-runs create duplicates, can't rehearse	Upsert by `legacyId`, two-pass references

Building Your Own AI-Powered CMS (2026) — A Stack-Agnostic Architecture & Blueprint

Chapter 17: Migrating 1000+ Pages off WordPress

Overview

Content

17.1 The mental model: migration is ETL with an SEO envelope

17.2 Stage 1 — Extract: WXR, REST, and the database (use all three)

17.3 Stage 2 — Parse: normalize to an Intermediate Representation

17.4 Stage 3 — Transform: HTML → structured content (the LLM-assisted core)

Layer A — Deterministic parsing (do this first, for ~90% of content)

Layer B — LLM-assisted, schema-constrained conversion (for the messy tail)

17.5 Stage 3b — Media migration

17.6 Stage 4 — Validate & QA: schema, fidelity, links, taxonomy

17.7 URL mapping & redirects (protecting SEO and AI equity)

17.8 Taxonomy & relationship mapping

17.9 Stage 5 — Load: idempotent, keyed, resumable

17.10 Phased cutover & rollback

17.11 Pitfalls at scale (the long-tail killers)

Key Takeaways

Key References