语音顶会Interspeech 论文解读｜Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks-阿里云开发者社区

语音顶会Interspeech 论文解读｜Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks

2019-09-11 1735

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Interspeech是世界上规模最大，最全面的顶级语音领域会议，本文为Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma的入选论文

2019年，国际语音交流协会INTERSPEECH第20届年会将于9月15日至19日在奥地利格拉茨举行。Interspeech是世界上规模最大，最全面的顶级语音领域会议，近2000名一线业界和学界人士将会参与包括主题演讲，Tutorial，论文讲解和主会展览等活动，本次阿里论文有8篇入选，本文为Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma的论文《Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks》

点击下载论文

文章解读

语音转换（Voice Conversion，VC）的主要目标是将源说话者的语音转换为目标说话者的语音，同时具有与原始样本相同的语言内容。语音转换系统有许多应用场景，例如原始语音增强，口语辅助和个性化的语音合成（TTS）系统。目前性能较好的语音转换系统，比如基于高斯混合模型（GMM）的方法和基于神经网络（NN）的方法，一般基于并行训练数据，其应用场景局限于并行数据的收集和同语言间的一对一转换。当收集并行数据困难时比如进行跨语言语音转换或者多对多的语音转换时，并行训练数据的要求极大地限制了上述方法在实际场景中的可用性。

最近，基于对抗生成网络（GAN）的StarGAN被引入到语音转换的问题中，利用其多对多的域映射性能和无需并行数据的训练性能，仅使用语音特征和域信息作为输入，获得了较成功的多对多不同说话者之间的语音转换实验结果。本文在上述StarGAN-VC方法的基础上，通过添加残差训练机制，提出了一种快速学习训练框架，我们的方法称为Res-StarGAN- VC，其主要思想是基于转换过程中的源语音特征和目标语音特征之间的语言内容共享，通过添加输入到输出的快捷连接方式（shortcut connections）来实现残差映射。

实验显示这种快捷连接方式在不增加参数和计算复杂性的情况下，加速了网络的学习过程，有助于在对抗训练开始时生成高质量的假样本来提高训练质量。实验结果和主观评估显示，在单语言和跨语言的多对多的语音转换任务中，与StarGAN-VC方法相比，我们提出的方法提供了（1）对抗训练中更快的收敛性和（2）更清晰的发音和更好的说话人相似性。

文章摘要

This paper proposes a fast learning framework for non-parallel many-to-many voice conversion with residual Star Generative Adversarial Networks (StarGAN). In addition to the state-ofthe-art StarGAN-VC approach that learns an unreferenced mapping between a group of speakers’ acoustic features for nonparallel many-to-many voice conversion, our method, which we call Res-StarGAN-VC, presents an enhancement by incorporating a residual mapping. The idea is to leverage on the shared linguistic content between source and target features during conversion. The residual mapping is realized by using identity shortcut connections from the input to the output of the generator in Res-StarGAN-VC. Such shortcut connections accelerate the learning process of the network with no increase of parameters and computational complexity. They also help generate high-quality fake samples at the very beginning of the adversarial training. Experiments and subjective evaluations show that the proposed method offers (1) significantly faster convergence in adversarial training and (2) clearer pronunciations and better speaker similarity of converted speech, compared to the StarGAN-VC baseline on both mono-lingual and cross-lingual many-to-many voice conversion tasks.
Index Terms: Voice conversion (VC), non-parallel VC,many-to-many VC, generative adversarial networks (GANs),StarGAN-VC, Res-StarGAN-VC

阿里云开发者社区整理