Skip to main content

Home/Case Studies/R2 Digital Library

Automating XML Conversion at Scale with AI

How Zentrovia built an AI-powered XML conversion pipeline that processes 67,000+ pages across 130 titles — reducing cost per title by 69% in five months.

67,000+

Pages processed

130

Titles delivered in 5 months

69%

Cost per title reduction

200+

Target titles/month at scale

Overview

A major healthcare digital library needed to convert thousands of legacy medical textbooks and references into structured XML — at a pace and cost that traditional vendors couldn't match.

The R2 Digital Library is a leading medical and healthcare reference platform used by over 1,000 hospitals, universities, and healthcare institutions across the United States. Their catalog includes thousands of medical textbooks, clinical references, nursing guides, and allied health publications from dozens of publishers.

Each title needed to be converted from its source format (PDF, Word, InDesign, or publisher-specific XML) into the R2 platform's proprietary DocBook-based XML schema — a complex DTD with 69 target elements that must render without errors on the R2 platform.

The challenge: source files came from 290 to 347 publisher-specific input variations. Traditional conversion vendors were quoting $1 to $3 per page — meaning a single 500-page textbook could cost $500 to $1,500 to convert. At that rate, converting the full catalog was economically unviable.

The Challenge

Three problems that made traditional conversion impossible at scale.

01

Publisher Variation Complexity

290 to 347 distinct publisher-specific input formats — each with different XML schemas, CSS styles, and structural conventions. No two publishers format their content the same way.

02

Cost Prohibitive at Scale

Traditional XML conversion vendors charge $1 to $3 per page. A 500-page medical textbook costs $500–$1,500 to convert. At thousands of titles, the total investment would exceed the entire project budget.

03

Quality Requirements

Every output XML must render with zero errors on the R2 platform. The 69 DocBook target elements must map correctly from hundreds of source variations — with no room for structural errors in medical content.

The Solution

A three-phase AI-powered pipeline built specifically for this problem.

Phase 1 — Conversion Engine

Custom DTD Conversion Pipeline

We built a custom conversion engine that maps 290–347 publisher-specific input variations to the R2 platform's 69 DocBook target elements. The engine handles the full complexity of medical publishing: cross-references, citations, figure captions, table structures, index entries, and multi-level heading hierarchies — producing zero-error XML output on the R2 platform.

Phase 2 — Automated Ingestion

BookLoader + Table of Contents Engine

Automated ingestion pipeline with table of contents generation, metadata tagging, and batch processing. This eliminated manual data entry and enabled high-volume processing — each title automatically ingested, structured, and queued for conversion.

Phase 3 — AI-Powered QA (Live April 2026)

Agentic AI Quality Assurance

The breakthrough: AI-powered agentic QA that runs parallel quality checks on every title — completing in minutes what previously took hours. The agent performs full structural XML review, cross-reference validation, DTD compliance checks, and self-corrects known error patterns — dramatically reducing QA overhead while maintaining the same quality standard.

Results

Measurable impact at every stage.

69%

Cost per title reduction

vs. manual baseline

80%

Faster turnaround

AI-powered vs. traditional vendors

99.5%

Accuracy rate

Zero XML errors on platform

200+

Target titles/month

Scaling with AI automation

Conversion reports showing 130 total conversions, 122 completed, processing analytics
Pipeline dashboard showing manuscript processing — 100 total, 94 completed
Admin dashboard with conversion status chart and files converted per month
Publisher management — multi-publisher support with settings configuration

Timeline

From first title to 200/month in six months.

Dec–Jan 2026

Delivered

30

titles

~15,000

pages

Pipeline development + first production batch. Proving the conversion engine handles publisher variations.

February 2026

Delivered

40

titles

~20,000

pages

Scaling production with BookLoader automation. Throughput increasing as publisher patterns are mapped.

March 2026

Delivered

60

titles

~30,000

pages

Full pipeline operating at capacity with manual QA processes.

April 2026

In progress

80

titles

~40,000

pages

AI Agentic QA deployed. Significant cost and time reduction while maintaining quality.

May–June 2026

Scale plan

160–200

titles

~100,000

pages

Full agentic pipeline with AI handling majority of QA. Scaling to 200+ titles/month.

Technical Architecture

What the AI pipeline automates.

01

Automated First

  • Publisher XML variation mapping
  • DTD compliance validation
  • Common error pattern detection
  • Metadata + TOC generation
02

AI Handles (Agentic)

  • Full structural XML review
  • Cross-reference validation
  • Known publisher anomaly flags
  • Self-correction on mapped errors
03

Stays Human

  • New publisher onboarding
  • Edge cases flagged by AI
  • Final sign-off on complex titles
  • Pipeline maintenance + updates

Tech Stack

What powers the pipeline.

PLATFORM

BackendPython + Node.js microservices
FrontendModern React dashboard with real-time updates
AI/MLClaude-powered conversion with prompt caching + batch processing
XMLIndustry-standard XSLT-based transformation
PDFDeep extraction pipeline — text, tables, images
ePubWCAG-aware accessible ePub 3 output

INFRASTRUCTURE

DatabaseManaged document database with flexible schemas
AuthToken-based sessions with hashed credentials
Real-timeLive progress + status updates over WebSocket
ContainersContainer-native deployment across environments
Pipeline290–347 publisher input variations → 69 DocBook elements
QAAgentic AI QA · parallel processing · self-correction

Need to convert content at scale?
Let's talk.

Book a free consultation and we'll analyze a sample of your content library — at no cost.