Convert text to voice with a compelling realistic AI voices that will enhance your brand while saving time and money.
Take a behind the scenes look at how WellSaid Labs creates an AI voice.
From L.A. DJ to an AI generated voice over. How this DJ cloned his voice.
This video is an example of a real estate walk through video using our AI voices
In June 2020, WellSaid became the first text-to-speech voice company to achieve Human Parity. That is to say that when someone hears the quality of a narration produced by an AI voice over it is equivalent to a human’s voice. WellSaid Voice Avatars speak as naturally as the voice actor’s original read style. Our focus on quality and naturalness led us to achieve record breaking success in creating the most beautiful voices produced by an AI.
We asked participants to listen to a set of randomized recordings, created by both synthetic and human voice actors. For each audio clip, participants were asked, “How natural (i.e. human-sounding) is this recording?” Each was then ranked on a scale of 1 (Bad: Completely unnatural speech) to 5 (Excellent: Completely natural speech). Human voice actors achieve an average of around 4.5 on the score, a level of perceived naturalness also achieved by our synthetic voices.
A WellSaid Avatar is the highest quality computer generated voice available today. Your audience won’t be able to tell the difference between a WellSaid Avatar and a person. This is important because we want to empower you to create content that engages your audience and moves people. We care deeply about helping other creators to produce the best content they can. The hyperrealism of our voices aims to help you achieve just that.
Text to voice is a type of technology that reads digital text aloud. With just a click of a button, text to voice converters can translate words on a digital device into human-sounding audio. Artificial intelligence continues to evolve what can be accomplished with speech synthesis, making text-to-speech acceptable (and attractive) for creative productions.
Below are a few ways to ensure text to speech voices sounds less machine-like and more life-like.
One of the reasons why early text-to-speech sounds robotic is because every single word is pronounced exactly the same way. When humans talk, they naturally vary how they say words, even the exact same ones. They add inflections, varying tones, and different emphases.
Another quality of human speech is that there are pauses. Humans need air, so they naturally pause to inhale, exhale, swallow, and start again. These pauses create rhythmic, natural-sounding variations. Whereas early text-to-speech forgot this nuance (robots, after all, don’t typically need to pause for oxygen) today’s text-to-speech sounds much more life-like because of it.
In today’s text to voice converters like WellSaid Labs, you can further simulate these pauses by adding in commas, dashes, periods, and ellipses, cueing the text to speech voice to take breaks, just as a human would. These punctuation marks function more as sheet music to the TTS than grammar—instructing the ai text to voice over to pause, hold, and create natural silences just like humans do.
Another challenge that early text to speech voice faced was that even the same words are pronounced differently depending on usage. Take the example of ‘read’. The past tense is pronounced ‘red’ while the present tense is pronounced ‘reed’. The text-to-speech of yore may have missed the difference, but today’s ai text to speech captures the subtleties with ease.
In the chance that any words or acronyms could be less clear, you can easily add phonetic spelling to the editor to ensure the text-to-speech picks up on the nuance. This is just like how you might help a voice actor. For example, instead of typing ‘COO’, you might spell out ‘C-O-O’ so the reader knows to pronounce the acronym versus blending the letters together.
In many cases, text to voice converters like WellSaid Labs handle long words and numbers even better than human actors. For example, try to read the word ‘antidisestablishmentarianism’ in one go. A text-to-speech voice is able to naturally piece the syllables together, creating a natural-sounding pronunciation that might escape most voice actors without a few practice runs.
Traditional methods to produce a voiceover are expensive. Paying for a voiceover to change a few words every time you need a content update is downright inefficient. It’s simply not cost-effective to create and update content that way. With text to voice, you can pay a flexible fee to convert your content to audio when needed, similar to how you would easily pay to transcribe a podcast episode.
Not only is it time-consuming and expensive to manage a voice production process, but retakes and content updates can slow down your entire schedule. You can’t afford to squeeze yourself back onto an agency’s calendar every time you need a content update, and tracking down a coworker in another department to cover for you isn’t exactly efficient, either. Text-to-speech enables you to quickly upload your content online, then receive an audio version—no calendar scheduling, booking appointments, or renegotiating contracts required.
To write text to voice scripts that are easy to understand, aim to write how you speak. Avoid using overly complicated language or long sentences. Read your content back to yourself to ensure there aren’t any words that trip you up. Sometimes, sentences sound better in your head than they do when spoken aloud. Watch for those sentences and reword them. Ensure that your punctuation is on point, using commas or ellipses strategically to indicate pauses.
The first component to this is ensuring your text to speech voice over matches the context of your content. For example, if you are training nurse practitioners in a scenario-based training module, you may want your ai voice over to invoke a sense of urgency when acting out an emergency. Or, you may want a voice that sounds professional, calm, and respectable, like a physician. You don’t necessarily want the voice of a promotional video to voiceover a cardiac arrest training, ya know?
In another example, if you are teaching content that is dry in nature, you want to make it as tolerable as possible for your listeners. It may be challenging to communicate this to an employee narrator or voice production company. There is an art to using inflection, cadence, and voice to keep a listener’s attention, and while you love your coworkers, voiceovers may simply not be their strong suit.
Not only does that mean you need to choose the appropriate voice for your scenario, but if you use a voice actor or employee narrator, you must also include these cues in your scripts.
Even if you’re on board with text to voice, the internet’s a big place. It can feel overwhelming choosing the right text to voice converter. Below, we address a few questions to help simplify your process when choosing the right text-to-speech partner. Grab a notepad, and consider the questions below.
Some online text to voice tools require using technical mark-up languages like SSML, which is like the HTML of text-to-speech. But when was the last time you drafted a great story in SSML? Most creatives want to (understandably) be able to use more natural language grammar, such as ellipses or commas to indicate pauses, or quotation marks to indicate emphasis. You know, the same grammar you use in your day-to-day life.
When choosing a text to voice solution, you want to ensure that you select a system that enables you to phonetically enter how you want the word to sound—then remember it for future productions.
You can find a lot of amateur-built, low quality, and unsafe text-to-speech tools online, often marketed at a sweet deal. But as an enterprise company, you want to make sure that the text-to-speech platform is built for corporate content. Doing so will help ensure you have the safeguards, technology, contracts, and customer support you would expect for an enterprise tool.
When considering which text-to-speech platform to use, make sure to read the fine print. Some platforms or agencies retain the rights to the content they help create. You want to make sure that a service doesn’t claim ownership of your intellectual property. The content you render should be wholly yours—your files, your scripts, and the voiceover itself.
Depending on the size of your organization or team, you may have several people involved in the production process, from writing to editing to recording to designing and beyond. Traditional recording methods aren’t scalable, in that only one person can record at once with the same voice. But with text to voice software, you can have multiple people using your same brand avatar(s) without having to pay for additional studio time.
With AI text to voice, you can have an entire production team working on multiple scenes. You can scale your team to race ahead of deadlines, break projects into doable pieces, and have your voice avatars ready and at your service when you—or your entire team—are ready. Imagine the whole process working together like a symphony. An under-budget, ahead-of-schedule symphony.