Code-switching speech synthesis for Mandarin-English using FastSpeech2: A unified IPA-based approach
Author: Wang Yinqiu
Overview:
This research explores two main methods for synthesizing natural-sounding code-switched speech between Mandarin and English using the FastSpeech2 model:
Method 1: Modeling the phonemes of Mandarin and English directly, which involved handling alignment with the Montreal Forced Aligner and developing a mixed dictionary and acoustic model.
Method 2: Unifying the input formats for both languages as phonological features based on the International Phonetic Alphabet (IPA), implemented using the IMS-Toucan GitHub repository.
Experimental Setup
For Method 1, three sets of experiments were conducted:
Group A: Pre-training on a code-switched Mandarin-English dataset for ASR tasks, then fine-tuning with 500 high-quality code-switched Mandarin-English audios.
Group B: Pre-training on a dataset for Mandarin TTS tasks, followed by fine-tuning using the 500 code-switched audios.
Group C: Pre-training on a dataset for Mandarin TTS tasks, followed by fine-tuning with an English TTS dataset.
For the unified IPA-based approach (Method 2/Group D), the input formats for both Mandarin and English were represented as phonological features based on the IPA. This method was implemented using the IMS-Toucan repository, with the pre-trained model fine-tuned on only 500 high-quality mixed Mandarin-English sentences.
Audio Samples
Groups / Sentences
Tiktok是最近非常热门的一款APP。
我对这个topic很感兴趣。
没关系,也许之后能有更好的chance。
Group A
Group B
Group C
Group D
Relevance:
Successful development of code-switching TTS systems can facilitate communication across languages, with applications in education, media, and assistive technologies for enhancing accessibility in multilingual societies.