Speech synthesis technology has come a long way since the robotic sounding voices from long ago. There are now many different services available to developers to generate speech that sounds very realistic from plain text input. This is making it easier than ever to give AI avatars their own voice.
Of course there are pros and cons to any approach you take for text-to-speech (TTS) generation. In this article, I just want to give a high level overview of 3 approaches to consider when deciding how to give your AI a voice.
1) USE PREMADE VOICES
The easiest approach is to use one of the many premade voices that are available through quite a few cloud based APIs. Of course all the big tech companies have their own cloud TTS service and all of them are quite natural sounding, however you are limited to selecting from one of the voices available. They are adding more and more voices, but sometimes none of the available voices quite fit your AI avatar’s personality! Each TTS API service usually is charged per character and usually gives a free allowance per month.
- Nuance Vocalizer – a pioneer in computer speech – When we started our journey with designing AI avatars this was one of the few options available. But in the past few years, many other tech companies have released their own services.
2) CREATE A UNIQUE VOICE
In order to give a voice that perfectly matches the personality of your AI avatar, you’ll want to create a custom voice. This involves recording sample audio data from a voice actor and using it to train a deep neural network. This new type of “neural voice” is nearly indistinguishable from a natural voice, which reduces the listening fatigue when users interact with your AI avatar. The downside of this approach is that you need to collect high quality audio data, transcribe the audio into text, and use one of the available APIs to train the deep neural network on your data. The API calls are generally more expensive than if you use the standard premade voices mentioned above. There are a few different options when going this route.
Resemble has recently released a Unity3D plugin to make it even easier to give your avatars a unique voice. It allows users to clone their voice with a few mins of data. Developers can add content through the GUI within Unity and tweak speech style and emotion by applying various emotions to the text.
3) TRAIN YOUR OWN DEEP NEURAL NETWORK MODEL
If you don’t want to rely on API calls that you have to pay for per character, then you should go straight to the source and train your own DNN TTS model. Sure, this approach is much more involved and only for developers experienced in machine learning, but the good news is that there are quite a few great models out there that are open source and readily available for you to use! The services listed above are based on these underlying algorithms.
Of course you’ll need to have access to a decent GPU to train your models, or pay for cloud hosted GPU compute power. You’ll also need to collect your audio samples, segment them, transcribe them, scrub the data, all prior to training your deep neural network. It is quite a bit of work but pays off in the long run when you have your own trained speech synthesis network and don’t need to pay for API calls.
This article is a very high level summary of different approaches to give your AI avatar a voice! This field is advancing rapidly with new services being offered all the time, so stay tuned for more updates. In fact it is moving so fast that likely by now there are new and improved algorithms not included in the list. If you have a recommendation for something to add to the list above please let me know in the comments below!