Voice characteristics conversion for HMM-based speech synthesis system

Created: 2022-05-22T21:22:09-05:00

Structure

Voice data is analyzed by mel-celstrum analysis.
Correlation between phonemes and mel-cepstrum parameters is done with a phoneme layer.
Correlation of sentence words to phonemes is done with another HMM layer.

Maximum A Posteriori (MAP) estimation and Vector Field Smoothing (VFS) algorithms.

MAP estimation is used to start with existing tuning and increment it towards known data about the new voice.
VFS is used to interpolate data from the new incoming voice where no training data was available.