Google uses neural net to synthesize female voice

Research at Google is making huge advances in text-to-speech (TTS) technology. Check this out:

From their Twitter post:

“Building on TTS models like ‘Tacotron’ and deep generative models of raw audio like ‘Wavenet’, we introduce ‘Tacotron 2’ a neural network architecture for speech synthesis directly from text.”

How do they do it? From their blog post:

“In a nutshell it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.”

The results are amazing.

Want more? Here’s the full research paper.

The limitations? Some complex words, sentiment and generation in real time. “Each of these is an interesting research problem on its own,” they conclude.

Listen to more samples.

My take: I’ve used TTS functionality to generate speech for songs and for voice-over. I love it! As the quality improves to the point where it becomes indistinguishable from human voice, I will admit that I’m not quite sure what that will mean in a future where we won’t be sure if the voice we’re hearing is human or robot.