In this post we will present a model that solves the singing voice beautifying (SVB) task. SVB aims to improve the intonation and vocal tone of an amateur singer’s voice, while keeping the content and vocal timbre and is usually performed by professional sound engineers.
Most current automatic pitch correction works are shown to typically focus on the intonation, but ignore the overall aesthetic quality, so this paper introduces Neural Singing Voice Beautifier (NSVB), which not only corrects the pitch of amateur recordings, but also generates the audio with high quality and improved vocal tone.
The Model
In NSVB, they split the SVB task into pitch correction and vocal tone improvement:
- To correct the intonation, they propose a novel time-warping approach for pitch correction: Shape-Aware Dynamic Time Warping (SADTW), to synchronize the amateur recording with the template pitch curve;
- To improve the vocal tone, the paper proposes a latent mapping algorithm, which converts the latent variables of the amateur vocal tone to those of the professional ones. This process is optimized by maximizing the log-likelihood of the converted latent variables.
NVSB overview:
To generate audio with high quality and learn the latent representations of vocal tone, we introduce a Conditional Variational AutoEncoder (CVAE) as the mel-spectrogram generator.
The CVAE backbone:
Comparison between the behavior of DTW, CTW and SADTW:
Experiments
Experiments were conducted on PopBuTFy, the dataset for SVB, which contains both Chinese Mandarin and English pop songs. To collect PopBuTFy for SVB, qualified singers majoring in vocal music were asked to sing a song twice, using the amateur vocal tone for one time and the professional vocal tone for another. Some of the amateur recordings are sung off-key by one or more semi-tones for the pitch correction sub-task. The parallel setting could make sure that the personal vocal timbre will keep still during the beautifying process.
Conclusions
In this post, we presented the Neural Singing Voice Beautifier, the first generative model for the SVB task, which is based on a CVAE model allowing semi-supervised learning. A robust alignment algorithm (SADTW) is used for pitch correction and a latent mapping algorithm for vocal tone improvement.
To retain the vocal timbre during the vocal tone mapping, they also propose a new specialized SVB dataset named PopBuTFy containing parallel singing recordings of both amateur and professional versions. The experiments conducted on the dataset of Chinese and English songs show that NSVB accomplishes the SVB task.
References
- https://paperswithcode.com/paper/learning-the-beauty-in-songs-neural-singing
- https://neuralsvb.github.io/
Niciun comentariu:
Trimiteți un comentariu