Building PowerTTS: PPT → Audio → Video Pipeline (FastAPI + React)
PowerPoint to TTS Converter
Problem / Context
- Creating professional narrated presentations is time-consuming
- Users (content creators, educators, businesses) typically need to:
- Record narration per slide manually
- Sync audio with slides
- Export to video for sharing
- Manage multiple voices and languages
- Traditional workflow involves multiple tools and takes hours
- PowerTTS goal: Upload a PowerPoint, select a voice, receive a fully narrated video quickly
Constraints
Compliance & Audit
- File size limit: max 50MB per PowerPoint (avoid server overload)
- Strict validation: only
.pptand.pptx(prevent malicious uploads) - Project isolation: unique UUID per project (prevent data leakage)
- Cleanup: automatic removal of temporary/expired project files
Legacy Data & Compatibility
- Supports
.ppt(legacy) and.pptx(modern) .pptconversion requires LibreOffice- Cross-platform support: Windows (COM automation), Linux, macOS
- Fallback slide rendering methods: LibreOffice → python-pptx
Timelines
- Real-time progress tracking for long operations
- Video generation runs in background threads (avoid blocking API)
- Timeout handling:
- LibreOffice conversions: 300s
- FFmpeg operations: 60s
Reliability & UX Improvements
- Retry logic (3 retries) for TTS failures
- Fallbacks (LibreOffice → python-pptx, MoviePy → FFmpeg)
- Real-time progress updates (reduces user confusion)
- Zero-config deployment with Docker Compose
- Multiple export formats + broad voice selection
Lessons Learned
- Direct FFmpeg calls beat wrappers (MoviePy is slower)
- LibreOffice is unreliable for batch export → must have fallbacks
- Background tasks require progress tracking channel (files/Redis/WebSockets)
- Edge TTS rate limits → backoff/retry is essential
- File-based storage works well for MVP simplicity
- Docker Compose simplifies deployment
What to Improve Next
- Add database (PostgreSQL) for multi-user + project history
- Replace polling with WebSocket progress updates
- Add queue system (Redis + Celery) for scalable processing
- Add caching (hash text+voice) to avoid regenerating same TTS
- Improve slide image fidelity (e.g., Playwright rendering)
- Add video customization (resolution/FPS/aspect ratio)
- Add audio post-processing (normalization, background music)
- Add batch/multi-project processing API
- Add analytics dashboard (usage + performance metrics)
- Offload encoding to cloud services for cost/scale