Training-Free Voice Conversion with Factorized Optimal Transport
Authors: Alexander Lobashev, Assel Yermekova, Maria Larchenko
๐ฏ Key Contributions
- Factorized MKL-VC replaces kNN regression with optimal transport map
- High quality cross-lingual conversion with only 5 seconds of reference audio
- Derived from Monge-Kantorovich Linear solution
- Outperforms kNN-VC, comparable to FACodec in cross-lingual domain
๐ Resources
Abstract
We introduce a novel training-free approach for voice conversion that leverages factorized optimal transport to transfer vocal characteristics between speakers without requiring parallel data or speaker-specific training. Traditional voice conversion methods rely on extensive training data and speaker-dependent models, limiting their applicability to new speakers or low-resource scenarios.
Our method factorizes the optimal transport problem into separable components corresponding to different acoustic attributes:
- Pitch contour: Fundamental frequency and intonation patterns
- Timbre: Spectral envelope and vocal tract characteristics
- Prosody: Rhythm, stress, and temporal dynamics
- Linguistic content: Phonetic and semantic information (preserved)
This factorization allows for zero-shot voice conversion while preserving linguistic content and natural speech quality. Each component is handled through separate optimal transport mappings that can be independently controlled, enabling fine-grained manipulation of converted speech characteristics.
We demonstrate state-of-the-art performance on standard benchmarks including VCC2018 and VCC2020 datasets, with particular advantages in:
- Cross-lingual voice conversion scenarios
- Conversion to previously unseen target speakers
- Preservation of emotional expressiveness and naturalness
- Computational efficiency compared to neural voice conversion methods
The training-free nature of our approach makes it immediately applicable to new speakers and languages without additional data collection or model retraining.
๐ Citation
@article{lobashev2025training,
title={Training-Free Voice Conversion with Factorized Optimal Transport},
author={Lobashev, Alexander and Yermekova, Assel and Larchenko, Maria},
journal={arXiv preprint arXiv:2506.09709},
year={2025}
}