Home/Case Studies/R2 Digital Library

Automating XML Conversion at Scale with AI

How Zentrovia built an AI-powered XML conversion pipeline that processes 67,000+ pages across 130 titles — reducing cost per title by 69% in five months.

67,000+

Pages processed

130

Titles delivered in 5 months

69%

Cost per title reduction

200+

Target titles/month at scale

Overview

A major healthcare digital library needed to convert thousands of legacy medical textbooks and references into structured XML — at a pace and cost that traditional vendors couldn't match.

The R2 Digital Library is a leading medical and healthcare reference platform used by over 1,000 hospitals, universities, and healthcare institutions across the United States. Their catalog includes thousands of medical textbooks, clinical references, nursing guides, and allied health publications from dozens of publishers.

Each title needed to be converted from its source format (PDF, Word, InDesign, or publisher-specific XML) into the R2 platform's proprietary DocBook-based XML schema — a complex DTD with 69 target elements that must render without errors on the R2 platform.

The challenge: source files came from 290 to 347 publisher-specific input variations. Traditional conversion vendors were quoting $1 to $3 per page — meaning a single 500-page textbook could cost $500 to $1,500 to convert. At that rate, converting the full catalog was economically unviable.

The Challenge

Three problems that made traditional conversion impossible at scale.

Publisher Variation Complexity

290 to 347 distinct publisher-specific input formats — each with different XML schemas, CSS styles, and structural conventions. No two publishers format their content the same way.

Cost Prohibitive at Scale

Traditional XML conversion vendors charge $1 to $3 per page. A 500-page medical textbook costs $500–$1,500 to convert. At thousands of titles, the total investment would exceed the entire project budget.

Quality Requirements

Every output XML must render with zero errors on the R2 platform. The 69 DocBook target elements must map correctly from hundreds of source variations — with no room for structural errors in medical content.

The Solution

A three-phase AI-powered pipeline built specifically for this problem.

Phase 1 — Conversion Engine

Custom DTD Conversion Pipeline

We built a custom conversion engine that maps 290–347 publisher-specific input variations to the R2 platform's 69 DocBook target elements. The engine handles the full complexity of medical publishing: cross-references, citations, figure captions, table structures, index entries, and multi-level heading hierarchies — producing zero-error XML output on the R2 platform.

Phase 2 — Automated Ingestion

BookLoader + Table of Contents Engine

Automated ingestion pipeline with table of contents generation, metadata tagging, and batch processing. This eliminated manual data entry and enabled high-volume processing — each title automatically ingested, structured, and queued for conversion.

Phase 3 — AI-Powered QA (Live April 2026)

Agentic AI Quality Assurance

The breakthrough: AI-powered agentic QA that runs parallel quality checks on every title — completing in minutes what previously took hours. The agent performs full structural XML review, cross-reference validation, DTD compliance checks, and self-corrects known error patterns — dramatically reducing QA overhead while maintaining the same quality standard.

Results

Measurable impact at every stage.

69%

Cost per title reduction

vs. manual baseline

80%

Faster turnaround

AI-powered vs. traditional vendors

99.5%

Accuracy rate

Zero XML errors on platform

200+

Target titles/month

Scaling with AI automation

Conversion reports showing 130 total conversions, 122 completed, processing analytics

Pipeline dashboard showing manuscript processing — 100 total, 94 completed

Admin dashboard with conversion status chart and files converted per month

Publisher management — multi-publisher support with settings configuration

Timeline

From first title to 200/month in six months.

Dec–Jan 2026

Delivered

titles

~15,000

pages

Pipeline development + first production batch. Proving the conversion engine handles publisher variations.

February 2026

Delivered

titles

~20,000

pages

Scaling production with BookLoader automation. Throughput increasing as publisher patterns are mapped.

March 2026

Delivered

titles

~30,000

pages

Full pipeline operating at capacity with manual QA processes.

April 2026

In progress

titles

~40,000

pages

AI Agentic QA deployed. Significant cost and time reduction while maintaining quality.

May–June 2026

Scale plan

160–200

titles

~100,000

pages

Full agentic pipeline with AI handling majority of QA. Scaling to 200+ titles/month.

Technical Architecture

What the AI pipeline automates.

Automated First

Publisher XML variation mapping
DTD compliance validation
Common error pattern detection
Metadata + TOC generation

AI Handles (Agentic)

Full structural XML review
Cross-reference validation
Known publisher anomaly flags
Self-correction on mapped errors

Stays Human

New publisher onboarding
Edge cases flagged by AI
Final sign-off on complex titles
Pipeline maintenance + updates

Tech Stack

What powers the pipeline.

PLATFORM

BackendPython + Node.js microservices

FrontendModern React dashboard with real-time updates

AI/MLClaude-powered conversion with prompt caching + batch processing

XMLIndustry-standard XSLT-based transformation

PDFDeep extraction pipeline — text, tables, images

ePubWCAG-aware accessible ePub 3 output

INFRASTRUCTURE

DatabaseManaged document database with flexible schemas

AuthToken-based sessions with hashed credentials

Real-timeLive progress + status updates over WebSocket

ContainersContainer-native deployment across environments

Pipeline290–347 publisher input variations → 69 DocBook elements

QAAgentic AI QA · parallel processing · self-correction

Need to convert content at scale?
Let's talk.

Book a free consultation and we'll analyze a sample of your content library — at no cost.

Start a conversation

Automating XML Conversion at Scale with AI

A major healthcare digital library needed to convert thousands of legacy medical textbooks and references into structured XML — at a pace and cost that traditional vendors couldn't match.

Three problems that made traditional conversion impossible at scale.

Publisher Variation Complexity

Cost Prohibitive at Scale

Quality Requirements

A three-phase AI-powered pipeline built specifically for this problem.

Custom DTD Conversion Pipeline

BookLoader + Table of Contents Engine

Agentic AI Quality Assurance

Measurable impact at every stage.

From first title to 200/month in six months.

What the AI pipeline automates.

Automated First

AI Handles (Agentic)

Stays Human

What powers the pipeline.

Need to convert content at scale?Let's talk.

Need to convert content at scale?
Let's talk.