This chapter is the field manual for moving a real, messy, decade-old WordPress estate — 1,000+ pages, tens of thousands of media files, custom post types, hand-built page-builder layouts, and accumulated SEO equity — into your new AI-native CMS without losing content, traffic, or sleep. We treat migration as an ETL pipeline (extract → parse → transform → validate → load) wrapped in an SEO/operations envelope (URL mapping, redirects, QA diffing, phased cutover, rollback). The hard parts are never the happy-path posts; they are the long tail of malformed HTML, serialized post-meta, orphaned media, and bespoke shortcodes. Everything below is stack-agnostic: the target could be Sanity, Strapi 5, Payload, Contentful, a Postgres-backed custom CMS, or a Git-based store — the pipeline shape is the same.
A 1,000-page migration is not a "press export, press import" task. At scale, the WordPress built-in export (Tools → Export, producing a WXR XML file) is fine for a 30-minute blog migration but breaks down for large, custom estates (DreamHost KB; thepublive publisher guide). Treat it as a repeatable, idempotent pipeline you can run dozens of times against staging before you run it once against production:
[WordPress source]
│ (1) EXTRACT — WXR XML + REST API + direct DB read + media files
▼
[Raw mirror on disk] ← canonical, version-controlled, re-runnable input
│ (2) PARSE — XML/HTML/JSON/URL parsers; normalize to an intermediate model
▼
[Intermediate Representation (IR) — JSON, one record per entity]
│ (3) TRANSFORM — HTML → structured content (blocks/Portable Text/MDX), LLM-assisted
▼
[Target-shaped documents, schema-valid]
│ (4) VALIDATE — schema, links, media refs, taxonomy, diff vs source
▼
[Target CMS ingest API] ← idempotent upserts keyed by a stable legacy ID
The two rules that save the project: (a) every stage writes to disk so you can re-run any later stage without re-hitting WordPress, and (b) every target document carries a legacyId (the WP post ID) so the load step is an idempotent upsert, never a blind create. You will run the loader 20+ times.
There is no single complete source of truth in WordPress. Use the right extractor per data class:
| Source | Best for | Watch out for |
|---|---|---|
WXR XML export (Tools → Export, or WP-CLI wp export) | Posts, pages, CPTs, taxonomy terms, authors, comments, structure | Media is referenced, not embedded — only attachment URLs, not files (WordPress/data-liberation Discussion #56). Serialized post-meta can be double-serialized/corrupted (Trac #10619). Large sites may need --max_file_size chunking. |
REST API (/wp-json/wp/v2/posts?per_page=100&page=N, plus /pages, /media, custom /wp/v2/<cpt>) | Rendered HTML, ACF/meta via plugins (acf-to-rest-api), predictable JSON, public + (authenticated) private content (Sanity Learn) | Default page cap (per_page max 100; paginate). rendered HTML runs shortcodes/blocks through filters — good for fidelity, bad if you want raw. Custom fields hidden unless show_in_rest is true. |
Direct DB read (mysqldump or read-only SQL on wp_posts, wp_postmeta, wp_terms, wp_term_relationships) | Ground truth, raw post_content, every meta row, draft/trash states, exact timestamps | Serialized PHP in meta_value; the page-builder soup (Elementor/Divi/WPBakery store layout as serialized blobs or shortcodes). |
Practical extract recipe for 1,000+ pages:
mysqldump snapshot first — your immutable ground truth and your rollback anchor.wp post list --post_type=any --format=json --fields=ID,post_type,post_status,post_name,post_date,post_parent > index.json. This gives you a complete entity inventory and the redirect-map seed.raw/posts/{id}.json). Add retry/backoff; a 1,000-page pull is thousands of requests./wp-content/uploads/ via the Media REST endpoint (/wp/v2/media) to enumerate every attachment + its source_url and metadata, then download originals to raw/media/. Capture alt_text, caption, title, and the image-size variants WP generated.The "four parsers" warning from the Data Liberation project is real: a single field can be a URL inside JSON inside an HTML comment inside the WXR XML, requiring an XML parser, an HTML parser, a JSON parser, and a WHATWG URL parser to fully resolve (WordPress/data-liberation). Plan for nested decoding, not regex.
Do not transform straight from WordPress shapes to target shapes. Insert an Intermediate Representation (IR) — a flat, boring JSON record per entity — so the extract side and the load side evolve independently.
// raw → IR, one file per entity: ir/post-1234.json
{
"legacyId": 1234,
"type": "post", // post | page | cpt:product | ...
"status": "publish",
"slug": "how-we-cut-latency",
"legacyUrl": "/2019/04/how-we-cut-latency/",
"title": "How we cut latency",
"html": "<!-- raw post_content / rendered HTML -->",
"excerpt": "...",
"author": "jdoe",
"publishedAt": "2019-04-03T09:00:00Z",
"modifiedAt": "2023-11-02T14:21:00Z",
"parent": 12,
"terms": { "category": ["engineering"], "post_tag": ["performance","cdn"] },
"meta": { "acf_hero_subtitle": "...", "_yoast_wpseo_metadesc": "..." },
"mediaRefs": [ {"legacyId": 5567, "url": "https://old.site/wp-content/uploads/2019/04/graph.png"} ]
}
At this stage you also decode the WordPress-isms: PHP-unserialize meta_value (carefully — guard against the double-serialize corruption in Trac #10619), expand Gutenberg block comments (<!-- wp:paragraph -->), and inventory every shortcode ([gallery], [caption], builder shortcodes) so nothing surprises you in transform. Produce a shortcode-census.csv (shortcode name → count) early; the long tail is where migrations die.
This is the chapter's centre of gravity. Your target CMS stores structured content — an array of typed blocks (Portable Text, Gutenberg-style blocks, Strapi blocks, MDX, or your own JSON schema) — not a blob of HTML. WordPress hands you a blob of HTML. Bridging that gap has two layers.
A unified/AST pipeline converts well-formed HTML to structured blocks deterministically and for free — no LLM, no per-token cost, fully reproducible. The JavaScript ecosystem standard is unified + rehype: rehype-parse turns HTML into a HAST syntax tree; you then map nodes to your target shape. For Sanity specifically, rehype-portable-text (by rexxars) outputs an array of Portable Text blocks directly; rehype-remark converts HTML → Markdown for MDX/Git targets (unifiedjs.com; remarkjs/remark-rehype). In Python, BeautifulSoup/lxml + a custom visitor does the same job.
Map known structures deterministically:
| HTML / WP construct | Target block |
|---|---|
<p>, <h2>–<h4>, <ul>/<ol>, <blockquote> | corresponding text blocks (lossless) |
<img> / WP image block | image block with asset ref + alt/caption |
<figure><img><figcaption> / [caption] | image block, caption preserved |
<table> | table block |
<iframe src="youtube...">, oEmbed | embed/video block |
[gallery ids="..."] | gallery block resolving to migrated assets |
<pre><code> | code block (preserve language hint) |
Anything you map deterministically is provably correct and needs no review. Maximize this layer.
The remaining content — page-builder soup, custom HTML embeds, semantically-meaningful-but-structurally-ambiguous markup, "this two-column div is really a callout" — is where an LLM earns its keep. The 2026 best practice is schema-constrained extraction: never ask the model for free-form output; force it to emit JSON that conforms to your block schema.
rawHtml fallback block and set needsReview: true." The verbatim-preservation clause is non-negotiable — an LLM's instinct to "improve" prose is a content-integrity bug at migration scale.Cost/throughput at 1,000+ pages: route by difficulty. If 90% goes through Layer A (deterministic, $0) and only the ~10% hard tail hits an LLM, you are calling the model ~100–150 times, not 1,000+. Use a cheap fast model (e.g. a Haiku/Flash-class tier) for the bulk and reserve a frontier model only for the few documents that fail validation. Batch APIs cut cost further for non-interactive runs. The deterministic-first split is what makes LLM migration affordable.
Media is the most underestimated workstream because WXR omits the actual files (WordPress/data-liberation). Plan it as its own pipeline:
/wp/v2/media (URL, mime, alt, caption, dimensions) into media-manifest.json. Cross-reference with mediaRefs harvested from content so you catch images referenced inline but not in the Media Library, and flag library assets referenced by nothing (candidates to drop).-300x200 thumbnail explosion).legacyAttachmentId → {newAssetId, newUrl}. This map is consumed by Stage 3 so every image/gallery block resolves to the migrated asset, and by Stage 4's redirect logic for hot-linked image URLs.alt, caption, title, credit.Pitfall: inline images often hard-code the full old domain (https://old.site/wp-content/uploads/...). A find-replace pass over content using the media ID map must run after upload, or you ship a site that hot-links the soon-to-be-dead origin.
Validation has four independent gates; a document ships only if it passes all four.
For site-level QA, crawl-diff staging against production. Screaming Frog (or Sitebulb) supports crawl comparison with URL Mapping, letting you compare two different URL structures (different hostnames/directories) and surface every status-code change, missing title/H1/meta, broken canonical, and lost hreflang; its Content tab reports near-duplicate similarity so a faithfully migrated page shows ~100% match and a mangled one stands out (Screaming Frog tutorials). Baseline a full pre-migration crawl (every URL, status, canonical, title, H1, meta, internal-link count) so you have something to diff against (BrightEdge 2025 migration guide). For rendered/visual proof, layer screenshot-based regression (e.g. a Playwright visual-diff pass, or Litmus for email/render checks) on the top-traffic templates.
This is where migrations win or lose traffic. A 301 redirect map — a spreadsheet of oldUrl → newUrl — preserves link equity and ranking power across the move; 301s pass PageRank and apply equally to server-level or CMS-integrated redirects (Duplicator; PS Branding). Enterprise sites that map 1:1 and monitor Search Console recover 95%+ of rankings within ~90 days; generic "redirect everything to the homepage" is the classic traffic-killer (BrightEdge; americaneagle).
Build the map programmatically, not by hand:
index.json, the XML sitemap(s), and a full Screaming Frog crawl (catches paginated archives, attachment pages, feed URLs, and pretty-permalink variants).legacyUrl and the new slug, most of the map auto-generates./2019/04/slug/), ?p=ID ugly permalinks, /category/.../, /tag/.../, /author/.../, attachment pages, and feed URLs (/feed/). Each pattern needs a rule.llms.txt if present, and keep canonical tags consistent.Implement redirects at the edge/server for performance (Nginx/Apache rules, Vercel/Netlify _redirects, Cloudflare Rules) rather than a slow per-request CMS plugin, and keep redirect chains to a single hop.
WordPress taxonomy (category, post_tag, custom taxonomies) and parent/child page hierarchies must be reconstructed in the target model. Build a taxonomy crosswalk (legacyTermId → targetTermId) before loading content, because content references terms by ID. This is also the cleanup opportunity: 1,000-page WP sites typically carry hundreds of near-duplicate, single-use, or typo tags. Use the migration to consolidate and rationalize the taxonomy — but make the merge decisions explicit in the crosswalk so they are reviewable and reversible. Resolve post_parent chains into your target's hierarchy/reference fields, and recreate author records (or map them to a byline field if authors aren't first-class in the new model).
Push to the target via its ingest path — Contentful Management API, Sanity dataset import / mutation API, Strapi REST/GraphQL, Payload local API, or direct DB seed (thepublive; Sanity Learn). Non-negotiables:
legacyId, never blind-create — so re-runs converge instead of duplicating.Do not big-bang a 1,000-page site. Sequence it:
mysqldump, document DNS reversion steps and TTLs (lower DNS TTL before cutover so reversion is fast), and define the abort criteria and the comms matrix (greengeeks; itmonks). Because your loader is idempotent and your source is an immutable snapshot, "re-run the pipeline" is itself part of the rollback toolkit.| Pitfall | Why it bites at 1,000+ pages | Mitigation |
|---|---|---|
| Page-builder content (Elementor/Divi/WPBakery) | Layout stored as serialized blobs/shortcodes, not semantic HTML | Render via REST (rendered) to get real HTML, or write per-builder shortcode handlers; LLM-fallback the residue |
| Double-serialized post-meta | Corrupted custom fields on import (Trac #10619) | Read meta from DB, unserialize defensively, validate before trusting |
| Missing media in WXR | Files referenced, not exported (data-liberation) | Separate media pipeline via /wp/v2/media + download |
| Hard-coded old-domain URLs in content | Site silently hot-links the dead origin | Post-upload find-replace using the media/URL maps |
| LLM rewriting/dropping text | Invisible content loss across hundreds of docs | Schema-constrained output + verbatim instruction + text-fidelity diff gate |
| Lost redirects for archive/feed/attachment URLs | Backlinks 404, equity evaporates | Full crawl + sitemap to enumerate every old URL, not just posts |
| Non-idempotent loader | Re-runs create duplicates, can't rehearse | Upsert by legacyId, two-pass references |
legacyId for upserts — you will run it 20+ times./wp/v2/media, download originals only, upload to a DAM/CDN/object store, and rewrite inline URLs via a legacyAttachmentId → newAsset map.mysqldump warm as a tested rollback, and re-crawl immediately post-launch; clean 301s typically recover 95%+ of rankings within ~90 days.