Dec 20, 2025 2 min read

Building PowerTTS: PPT → Audio → Video Pipeline (FastAPI + React)

PowerPoint to TTS Converter

Problem / Context

Creating professional narrated presentations is time-consuming
Users (content creators, educators, businesses) typically need to:
- Record narration per slide manually
- Sync audio with slides
- Export to video for sharing
- Manage multiple voices and languages
Traditional workflow involves multiple tools and takes hours
PowerTTS goal: Upload a PowerPoint, select a voice, receive a fully narrated video quickly

Constraints

Compliance & Audit

File size limit: max 50MB per PowerPoint (avoid server overload)
Strict validation: only .ppt and .pptx (prevent malicious uploads)
Project isolation: unique UUID per project (prevent data leakage)
Cleanup: automatic removal of temporary/expired project files

Legacy Data & Compatibility

Supports .ppt (legacy) and .pptx (modern)
.ppt conversion requires LibreOffice
Cross-platform support: Windows (COM automation), Linux, macOS
Fallback slide rendering methods: LibreOffice → python-pptx

Timelines

Real-time progress tracking for long operations
Video generation runs in background threads (avoid blocking API)
Timeout handling:
- LibreOffice conversions: 300s
- FFmpeg operations: 60s

Reliability & UX Improvements

Retry logic (3 retries) for TTS failures
Fallbacks (LibreOffice → python-pptx, MoviePy → FFmpeg)
Real-time progress updates (reduces user confusion)
Zero-config deployment with Docker Compose
Multiple export formats + broad voice selection

Lessons Learned

Direct FFmpeg calls beat wrappers (MoviePy is slower)
LibreOffice is unreliable for batch export → must have fallbacks
Background tasks require progress tracking channel (files/Redis/WebSockets)
Edge TTS rate limits → backoff/retry is essential
File-based storage works well for MVP simplicity
Docker Compose simplifies deployment

What to Improve Next

Add database (PostgreSQL) for multi-user + project history
Replace polling with WebSocket progress updates
Add queue system (Redis + Celery) for scalable processing
Add caching (hash text+voice) to avoid regenerating same TTS
Improve slide image fidelity (e.g., Playwright rendering)
Add video customization (resolution/FPS/aspect ratio)
Add audio post-processing (normalization, background music)
Add batch/multi-project processing API
Add analytics dashboard (usage + performance metrics)
Offload encoding to cloud services for cost/scale