Code-switching speech synthesis for Mandarin-English using FastSpeech2: A unified IPA-based approach

Author: Wang Yinqiu

Overview:

This research explores two main methods for synthesizing natural-sounding code-switched speech between Mandarin and English using the FastSpeech2 model:

Method 1: Modeling the phonemes of Mandarin and English directly, which involved handling alignment with the Montreal Forced Aligner and developing a mixed dictionary and acoustic model.
Method 2: Unifying the input formats for both languages as phonological features based on the International Phonetic Alphabet (IPA), implemented using the IMS-Toucan GitHub repository.

Experimental Setup

For Method 1, three sets of experiments were conducted:

Group A: Pre-training on a code-switched Mandarin-English dataset for ASR tasks, then fine-tuning with 500 high-quality code-switched Mandarin-English audios.
Group B: Pre-training on a dataset for Mandarin TTS tasks, followed by fine-tuning using the 500 code-switched audios.
Group C: Pre-training on a dataset for Mandarin TTS tasks, followed by fine-tuning with an English TTS dataset.

For the unified IPA-based approach (Method 2/Group D), the input formats for both Mandarin and English were represented as phonological features based on the IPA. This method was implemented using the IMS-Toucan repository, with the pre-trained model fine-tuned on only 500 high-quality mixed Mandarin-English sentences.

Audio Samples

Groups / Sentences	Tiktok是最近非常热门的一款APP。	我对这个topic很感兴趣。	没关系，也许之后能有更好的chance。
Group A
Group B
Group C
Group D

Relevance:

Successful development of code-switching TTS systems can facilitate communication across languages, with applications in education, media, and assistive technologies for enhancing accessibility in multilingual societies.