This chapter covers how to make a custom, AI-powered CMS speak many languages without re-engineering it later. It treats i18n (the architectural plumbing) and l10n (the per-locale content work) as distinct layers, then walks through locale modeling, URL routing strategies (subpath vs. subdomain vs. ccTLD), an AI-assisted translation pipeline with human review and translation memory, the SEO/discoverability layer (hreflang, x-default), right-to-left (RTL) support, locale-aware formatting via the Intl API and Unicode CLDR, and finally the operational patterns that let you scale localization across 1,000+ pages and dozens of locales. Throughout, the guidance is stack-agnostic: examples reference Next.js, headless CMSs (Contentful, Sanity, Strapi, Payload), and localization platforms (Crowdin, Lokalise), but the principles transfer to any architecture.
The distinction matters because it tells you when to spend money. Internationalization (i18n) is the one-time architectural work that lets your system support multiple locales at all — locale-aware data models, externalized strings, direction-agnostic CSS, routing that can carry a locale, and formatting that defers to the platform. Localization (l10n) is the recurring, per-locale work — translating, transcreating, adapting imagery, and validating. The expensive mistake is bolting i18n on after launch: retrofitting RTL, splitting hard-coded English out of templates, or rewriting URLs after they are indexed all cost far more than designing for it on day one. As multiple practitioner guides put it, i18n is "the architectural work that enables translation and localization … designing your entire system so you support multiple locales without re-engineering" (Weglot, Website Internationalization Guide).
For an AI-native CMS this has an extra dimension: the same content model that serves humans must serve translation agents and LLM-driven editorial tools. Clean separation of source content, translated content, and translation metadata (status, reviewer, MT engine, confidence) is what makes an automated pipeline auditable.
A locale is more than a language. The web standard is a BCP 47 language tag: a primary language subtag (ISO 639-1, e.g. , , ), optionally a region (ISO 3166-1 alpha-2, e.g. , ), and optionally a script (ISO 15924, e.g. vs. ). Model the locale as a first-class entity in your CMS, not a string sprinkled across tables:
enfraren-USpt-BRzh-Hanszh-Hant| Field | Example | Why it matters |
|---|---|---|
code (BCP 47) | pt-BR | Canonical key for routing, hreflang, Intl |
display_name | Português (Brasil) | Locale switcher UI, in the locale's own script |
direction | rtl / ltr | Drives dir attribute and layout |
fallback_chain | pt-BR → pt → en | What to serve when a translation is missing |
enabled / default | bool | Soft-launch and default-locale logic |
region/currency/date_pref | BRL, dd/mm/yyyy | Commerce + formatting defaults |
Two design decisions dominate everything downstream:
1. Field-level vs. document-level (entry-level) localization. Field-level keeps one document and stores translations per field; document-level creates a separate document per locale linked by a shared key. Field-level is leaner when most structure is identical across locales and only text changes — it gives a single source of truth and trivial fallback. Document-level wins when locales diverge structurally (different sections, region-specific products, legal blocks) or need independent publishing workflows. The major headless CMSs reflect this split: Payload and Sanity support both field-level and document-level with configurable fallback; Strapi ships field-level localization with 500+ pre-created locales; Contentful uses per-locale fields by default but supports per-locale spaces for stronger isolation at the cost of admin overhead (DEV/Sanity/Strapi platform docs, 2026). For a custom CMS, the pragmatic default is field-level with an explicit locale column and a fallback chain, escalating specific content types to document-level only when they genuinely diverge.
2. The fallback chain. Never render an empty page because a locale is missing a string. Resolve in order: requested locale → base language → default locale, and mark fallback content visibly in the editorial UI so it is not mistaken for "done." This is the difference between a half-translated site that degrades gracefully and one that shows blank regions.
Where the locale lives in the URL is a near-irreversible decision once pages are indexed. The three options:
| Strategy | Example | Pros | Cons | Best for |
|---|---|---|---|---|
| Subdirectory / subpath | site.com/fr/about | Inherits domain authority; one SSL cert, one DNS zone; trivial in Next.js middleware; cheapest to operate | Single server/CDN region implied; weaker per-country branding | The default for most projects |
| Subdomain | fr.site.com | Some geo/infra isolation; can point to regional hosting | Often treated as a separate site for SEO authority; more DNS + cert management | Governance/platform isolation, regional infra |
| ccTLD | site.fr | Strongest country signal; legal/trust benefits in some markets | Splits link equity across domains; highest cost; needs mature governance | Large enterprises with per-market teams and budget |
Consensus across international-SEO sources (Weglot, gtechme, DCHost, LinkGraph) is that subdirectories are the right default: they sit under the main domain and inherit its authority, keep the operational surface small (one cert, one DNS zone, one deploy), and "work well with Next.js Middleware" for locale detection and rewriting. ccTLDs and subdomains "multiply the operational surface with more DNS zones and SSL certs," while subfolders "keep most of the stack centralized." Reserve ccTLDs for cases where legal presence, payment rails, or per-market autonomy genuinely demand separate domains.
Implementation notes for a custom CMS / Next.js:
Accept-Language, then a persisted cookie. Detect — but never force — a redirect; let users override and persist their choice. Google's John Mueller has stated hreflang is a signal, not a directive, so don't fight the user's explicit selection.site.com/ for the default and site.com/fr/ for others; if you do this, make sure hreflang still lists the default explicitly./fr/a-propos, not /fr/about) for UX and local SEO — store a per-locale slug field and maintain redirects when slugs change.The 2026 baseline is a hybrid pipeline, not "AI replaces translators." Surveys cited by CAT-tool vendors report ~88% of translators already use CAT tools with 30–50% efficiency gains, and the tooling has shifted from pure translation memory to LLM-assisted workflows (Trados Studio's Copilot AI Assistant, Phrase Language AI, Lokalise/Crowdin AI). The recurring finding: raw LLM calls cannot respect glossaries deterministically — "they enforce instructions probabilistically, meaning output can vary across runs" (Lokalise, AI Translation Glossary). Determinism has to come from the infrastructure around the model.
A robust pipeline for an AI-native CMS:
Source string (extracted, with context)
│
▼
[1] Translation Memory (TM) lookup
├─ exact match → reuse approved translation (no MT cost, instant)
└─ fuzzy match → pre-fill, flag for light review
│ (miss)
▼
[2] MT / LLM translation
├─ glossary/termbase injected as constraints
├─ context: surrounding strings, page type, tone, screenshot/DOM if available
└─ engine choice: DeepL / Google / GPT / Claude / Gemini (route by language pair + content type)
│
▼
[3] Automated QA
├─ placeholder & ICU syntax checks ({count}, <b>…</b>)
├─ glossary enforcement / back-translation verification
├─ length & layout checks (pseudolocalization)
└─ MQM-style quality estimation score
│
▼
[4] Human review (risk-tiered)
├─ high-visibility / legal / brand → full human edit
├─ medium → spot review of low-confidence segments
└─ low / fuzzy-100% → auto-approve
│
▼
[5] Publish → write back to TM (the loop closes)
Engine selection. Treat the translator as pluggable. DeepL launched an LLM tuned specifically for translation; its blind tests reported outputs needing "two to three times fewer edits" than Google or GPT-4 for many language pairs. General LLMs (Claude, GPT-4o-class, Gemini) win on context, idiom, tone, and following editorial instructions, and are stronger for marketing/transcreation. A practical rule: DeepL/specialized NMT for high-volume technical strings; an instruction-following LLM for nuanced, brand, or low-resource content; route by language pair and content type. Newer research formalizes terminology-constrained MT and back-translation verification (e.g. LLM-BT, arXiv:2506.08174; dual-stage terminology-aware translation, arXiv:2511.07461) — useful when terminology consistency is contractual.
Translation memory and termbase are the durable assets. TM stores approved source→target segments and serves exact and fuzzy matches; the termbase/glossary enforces brand and domain vocabulary. Together they cut cost on every subsequent job, keep terminology consistent across 1,000 pages, and feed the LLM as deterministic constraints. In a custom CMS, store TM in your own database (segment, locale pair, hash, status, approver, last-used) so you own the asset and can use it as RAG context for the LLM step.
Glossary injection, done right. Because LLMs honor glossaries probabilistically, enforce at three points: (a) inject the relevant termbase entries into the prompt, (b) run a deterministic post-check that fails the segment if a forbidden term appears, and (c) for critical terms, back-translate and compare. Hybrid LLM+NMT setups have hit high MQM scores with these guardrails.
Human-in-the-loop, risk-tiered. Not every string deserves the same scrutiny. Auto-approve 100% TM/fuzzy matches; route only low-confidence or high-stakes segments to humans. This is where MQM-style quality estimation pays off — it tells the workflow which segments need eyes.
hreflang tells search engines which localized URL to serve to which audience. Google's own documentation and the 2025–2026 SEO guides converge on a strict set of rules:
<link> elements is identical on every variant. If A links to B, B must link back to A.https://site.com/fr/contact, never /fr/contact.en, en-US, fr-CA.x-default. Add one hreflang="x-default" pointing to the fallback users land on when no locale matches — typically your default-language or language-selector page. It is "insurance against uncertainty."<link> in <head> (most common), HTTP Link headers (for non-HTML like PDFs), or an XML sitemap (best at very large scale — see scaling below). Pick one method per URL; don't mix conflicting signals.Minimal example for three locales plus default:
<link rel="alternate" hreflang="en-us" href="https://site.com/about" />
<link rel="alternate" hreflang="fr-fr" href="https://site.com/fr/a-propos" />
<link rel="alternate" hreflang="ar" href="https://site.com/ar/about" />
<link rel="alternate" hreflang="x-default" href="https://site.com/about" />
For an AI-native CMS, generate hreflang programmatically from the locale model and the per-locale slug table — manual maintenance across 1,000 pages and N locales is the single most common source of broken international SEO. Validate continuously (Screaming Frog, Ahrefs, or a CI check) because a single non-reciprocal link can cause Google to ignore the cluster.
Roughly a fifth of the world reads right-to-left (Arabic, Hebrew, Persian, Urdu). RTL is not "flip everything" — it is mirroring layout while keeping numbers, code, and embedded LTR strings correct. The 2026 consensus is to design direction-agnostically from the start using CSS logical properties rather than retrofitting.
Core techniques:
<html lang="ar" dir="rtl">. The lang and dir attributes must reflect the rendered locale; drive both from the locale model.margin-inline-start / margin-inline-end, padding-inline-*, border-inline-*, inset-inline-*, and text-align: start | end instead of left/right. These flip automatically with dir, so one stylesheet serves both directions. Browser support is excellent in 2026 — "safe to use without fallbacks." Flexbox and Grid also reverse automatically under dir="rtl".<bdi> or Unicode isolates to prevent reordering glitches. Set letter-spacing: 0 for Arabic — its letters connect and spacing breaks the script.Frameworks help: Tailwind's RTL variants and logical-property utilities, and component libraries with built-in RTL (Untitled UI, Flowbite, MUI) reduce manual work — but the discipline is yours.
Never hand-format dates, numbers, or currencies. The browser and Node already ship the world's locale data via Unicode CLDR ("all major browsers and modern phones use CLDR"), exposed through the standard Intl API (ECMA-402):
Intl.NumberFormat(locale, { style: 'currency', currency: 'EUR' }) — handles grouping separators, decimal marks, and currency placement per locale (ISO 4217 codes).Intl.DateTimeFormat(locale, {...}) — resolves dd/mm vs. mm/dd, calendars, and time zones.Intl.RelativeTimeFormat, Intl.ListFormat, Intl.PluralRules, Intl.Collator — relative times, lists, plural categories, and language-correct sorting.For interpolated, pluralized, and gendered UI strings, use MessageFormat 2 (MF2). The Unicode MF2 spec reached Stable status in CLDR 47 (March 2025) and continues to evolve (LDML 48, October 2025); the JS messageformat 4.0 library provides an MF2 formatter plus a polyfill of the runtime API proposed for ECMA-402. MF2 replaces brittle string concatenation with a single message that encodes plural and select logic — critical because plural rules differ wildly (English has 2 categories, Arabic has 6, Polish 4). Store source messages in MF2/ICU syntax, validate placeholder integrity in your QA step, and never let translators "translate around" a placeholder.
A subtle point for AI pipelines: ICU/MF2 syntax must survive the LLM. Lock placeholders ({count}, {name}) as non-translatable tokens, pass them through verbatim, and fail QA if the count or names change.
At small scale, manual translation in the CMS works. Past a few hundred pages and a handful of locales, you need continuous localization — the same CI/CD discipline applied to content.
| Concern | Pattern |
|---|---|
| String extraction | Externalize all UI text into resource files / a content store; never hard-code. New strings are detected automatically. |
| Source of truth | Keep source strings in one place (repo or CMS); treat target locales as derived artifacts. |
| Automation | Webhook on content publish → enqueue translation job → MT/LLM + TM → QA → review tier → publish back. Crowdin/Lokalise connect to GitHub/GitLab/Bitbucket, "detecting new strings automatically, syncing translations back via pull requests, and using webhooks to notify your pipeline." |
| Platforms | Crowdin (600+ integrations, strong for LSP-style multi-project, scale), Lokalise (deep dev workflows: Git SDKs, webhooks, OTA, no-code "Workflows" sequencing assign→translate→QA), GitLocalize (Git-native, community/docs), Smartling (enterprise + managed services). Or build a thin in-house orchestrator over your TM and MT engines if you want to own the loop. |
| hreflang at scale | Generate from the locale model; deliver via XML sitemap rather than per-page <head> tags when N×pages is large. |
| Cost control | TM-first (free reuse), fuzzy pre-fill, batch identical content, route low-risk segments to MT-only auto-approve, reserve human review for high-value pages. Lokalise's batching "groups content to optimize costs." |
| Quality at scale | Pseudolocalization in CI catches layout/overflow before translation; automated MQM/QE scoring flags only the segments that need humans. |
| Freshness | Source-change detection invalidates affected translations and marks them "needs update," so the system never silently serves stale localized content. |
The architectural payoff of doing locale modeling, externalized strings, and a TM-backed pipeline correctly is exactly this: adding the 25th locale or the 5,000th page is a configuration and budget question, not an engineering project. That is the entire promise of i18n done up front.
/fr/) are the right default routing strategy for most sites — they inherit domain authority and minimize operational surface; reserve subdomains/ccTLDs for governance or legal isolation.x-default) programmatically from the locale model — manual maintenance is the top cause of broken international SEO.dir/lang from day one, test with real Arabic/Hebrew content, and use <bdi>/isolates for embedded LTR runs.Intl/CLDR, and use MessageFormat 2 (stable since CLDR 47, March 2025) for plurals and gender; protect placeholders through the AI step.Intl.NumberFormat/DateTimeFormat/PluralRules reference.