Training-Free Voice Conversion with Factorized Optimal Transport

Training-Free Voice Conversion with Factorized Optimal Transport

Authors: Alexander Lobashev, Assel Yermekova, Maria Larchenko

Interspeech 2025 2025
optimal-transport voice-conversion audio-processing training-free zero-shot

๐ŸŽฏ Key Contributions

  • Factorized MKL-VC replaces kNN regression with optimal transport map
  • High quality cross-lingual conversion with only 5 seconds of reference audio
  • Derived from Monge-Kantorovich Linear solution
  • Outperforms kNN-VC, comparable to FACodec in cross-lingual domain

Abstract

We introduce a novel training-free approach for voice conversion that leverages factorized optimal transport to transfer vocal characteristics between speakers without requiring parallel data or speaker-specific training. Traditional voice conversion methods rely on extensive training data and speaker-dependent models, limiting their applicability to new speakers or low-resource scenarios.

Our method factorizes the optimal transport problem into separable components corresponding to different acoustic attributes:

  • Pitch contour: Fundamental frequency and intonation patterns
  • Timbre: Spectral envelope and vocal tract characteristics
  • Prosody: Rhythm, stress, and temporal dynamics
  • Linguistic content: Phonetic and semantic information (preserved)

This factorization allows for zero-shot voice conversion while preserving linguistic content and natural speech quality. Each component is handled through separate optimal transport mappings that can be independently controlled, enabling fine-grained manipulation of converted speech characteristics.

We demonstrate state-of-the-art performance on standard benchmarks including VCC2018 and VCC2020 datasets, with particular advantages in:

  • Cross-lingual voice conversion scenarios
  • Conversion to previously unseen target speakers
  • Preservation of emotional expressiveness and naturalness
  • Computational efficiency compared to neural voice conversion methods

The training-free nature of our approach makes it immediately applicable to new speakers and languages without additional data collection or model retraining.

๐Ÿ“‹ Citation

@article{lobashev2025training,
  title={Training-Free Voice Conversion with Factorized Optimal Transport},
  author={Lobashev, Alexander and Yermekova, Assel and Larchenko, Maria},
  journal={arXiv preprint arXiv:2506.09709},
  year={2025}
}