Audio by Owen C. using WellSaid Labs
This post is from the WellSaid Research team, exploring breakthroughs and thought leadership within audio foundation model technology.
No one sums up the English language better than David Burge: “Yes, English can be weird. It can be understood through tough thorough thought, though”.
Let’s be honest—English is weird. Sometimes it seems as if there are more exceptions than there are rules. And even when these rules begin to feel like second nature, you can still be handed a new word to say aloud and have no clue where to begin.
When it comes to training a TTS model, even if we could provide machines with a list of every rule and exception in the English language, it still would not encompass the diversity of pronunciations found in natural conversation.
Think about the classic example “bagel” – is it /ˈbeɪɡl̩/, /ˈbɛɡl̩/, or /ˈbæɡl̩/? Add in another word like “bag” in the sample sentence “I bought a bag of bagels” and the variations of what is possible for a single speaker grow.
At WellSaid, it is important for us to capture the authentic accent and delivery that each individual voice actor brings to their performance when they record for us, diversity of pronunciation included. We also strive to provide correct pronunciations for our users for known and unknown words, such as branded words, within the realm of what a single speaker would authentically say. Instead of creating strict rules to guide our TTS’ performance, we sought to engineer a flexible yet simple way for our users to guide pronunciation.
This is what we have achieved with Respellings.
The Complexities of Pronunciation: Graphemes and Phonemes
In order to discuss pronunciation, there are two important words to have in our vocabulary: graphemes and phonemes. Graphemes are how we write words and phonemes are how words sound when spoken. In other words, graphemes are the written representation of phonemes.
While this may seem straightforward, in the English language there are 229 graphemes and only 44 phonemes, meaning that a single sound can be written multiple ways. For example, to write the phoneme “AY” (or /eɪ/) you can use any of the following graphemes: a, ai, eigh, aigh, ay, er, et, ei, au, ea, ey. Conversely, one grapheme can also correspond to multiple phonemes. For instance, to write the phoneme “IY” (or /aɪ/) you can use any of the following graphemes: i, y, igh, ie, uy, ye, ai, is, eigh. Comparing the graphemes attributed to the phonemes “AY” and “IY” you may notice that, among others, “eigh” is shared. For that reason, the words “eight” and “height” have different vowel sounds despite sharing the grapheme “eigh.”
One grapheme corresponding to multiple phonemes also gives rise to heteronyms—words with identical spellings but different pronunciations and meanings depending on the context. Heteronyms include the word “read” which sounds different depending on its grammatical function: “I read that book yesterday” is pronounced differently than “I like to read.”
The complexity only deepens from here. Not only are single phonemes represented by a variety of graphemes (and vice versa), as we begin to work at the word level, we find that graphemes and phonemes don’t naturally map together in a straightforward manner. Take the word “apple,” for instance. While this may be a common word for many, when broken down there are a few things you may not have previously noticed. Although there are two “p”s, we only pronounce the first; the “e” at the end of the word is silent; and between the “p” and the “l” there is a vowel sound, /ə/ or schwa, that is not written.
Add to this the fact of regional and preferential variations for word pronunciations, and it’s quickly apparent that there is a tenuous link between a word’s spelling and its many varied, all correct, vocalizations. While this is the beauty of spoken language, it also poses a unique challenge for training a TTS model. We will always have different ways of pronouncing words, names, places, and so on. We will also have many words, such as industry-specific terms, brand names, or proper nouns, that the TTS will never have been exposed to.
All these words are of exceeding importance to our users, so the question at hand becomes: How can we standardize pronunciation so that, without hearing a word, there is agreement on its intended pronunciation? What do we do with non-standard words and their pronunciations?
Standardizing Sounds with the International Phonetic Alphabet (IPA)
This is where the International Phonetic Alphabet (IPA) steps in. IPA was developed in the late 19th century as a way to standardize the written representation of all of the sounds we hear, not only in English, but across all languages. More specifically, IPA is a set of symbols wherein, each symbol maps to a single unique sound. We have used IPA symbols already in this blog post (see the conversation around the word “bagel” above) to refer back to very specific sounds!
Through one-to-one mapping, IPA can maximize consistency and minimize ambiguity in spoken language. For example, the word “apple” becomes /ˈæ.pəl/ addressing the aforementioned issues: there is only one “p”, there is no “e” at the end, and the vowel sound between the “p” and “l” is represented by /ə/. As a result, even though a word’s spelling does not equate to a single pronunciation, there is only one way to pronounce its IPA representation.
Respelling Systems as an Approach to Phonetic Transcription
Although the standardization resulting from IPA may initially seem like the best solution for disambiguating sounds and guiding TTS pronunciation, it has one big caveat: IPA consists of 107 letters (e.g. a, b, ð); 52 diacritics (such as umlauts); and four extra-lexical prosodic marks covering all languages, beyond just English.
IPA is a great solution for those who need definitive pronunciations; however, most people do not require this level of precision. While it is a system that is useful in teaching a deep learning model, it is both too precise and too cumbersome for users to apply.
Another industry-standard option for allowing users to customize and shape pronunciation is using a respelling system. These systems rely on a much smaller pool of alphabetic symbols and diacritics. Although these systems may not be as rigorous as IPA, they tend to have a much smaller learning curve, making them more suitable for everyday use. In fact, there are respelling systems that were created even before IPA, with some dating back to the mid-18th century.
Moreover, you have most likely already interacted with a modern respelling system. Of the more recently engineered respelling systems, for instance, when you search Google for the pronunciation of a word, the result provided is no longer in an IPA format. Instead, Google provides a set of more readable symbology based off of their own internal respelling system.
Beyond simplifying symbology, respelling systems have also created new ways to eliminate the need for diacritics altogether. As an example, Wikipedia created a respelling key boiling down IPA to just what is relevant to English and eliminated all need for diacritics. This includes using uppercase to indicate stress on syllables as opposed to the raised vertical line [ˈ], indicating primary stress, and lowered vertical line [ˌ], indicating secondary stress, used in IPA.
Another example of a respelling system is the ARPAbet, developed by the Advanced Research Projects Agency which similarly maps IPA symbology to simpler transcription codes. Beyond some differences in symbology between ARPAbet and Wikipedia’s respellings, ARPAbet uses the numbers 0, 1, 2, and 3 to indicate stress on syllables. Interestingly, the ARPAbet was used to create the CMU Pronouncing Dictionary which transcribes over 134,000 North American English words to their ARPAbet counterparts.
The introduction of respelling systems has been incredibly impactful in the English pronunciation space, making transcriptions far more intuitive and accessible, but it is important to note that they were not necessarily created with both user-friendliness and TTS in mind. Take the Wikipedia respelling system as an example. Though it maps almost all IPA to more understandable symbology there are still 68 symbols used.
Additionally, some of the symbols are redundant such as “our” which is just a combination of “ow” and “ər”. The ARPAbet narrows down to 50 symbols but creates an additional layer of complexity through the numbers added to the end of syllables to represent stress. All in all, there was still some work to be done to create a user-friendly system that could also be learned by a TTS model.
Our solution: Respellings
At WellSaid we wanted to take the guesswork out of respelling words without the complexities that come with IPA, similar to the respelling systems introduced above. With a need for a combination of simplicity for users and rigor for a TTS model, we have engineered our very own system: Respellings.
In fact, we are the very first in the TTS industry to do this! Despite the fact that IPA is complicated, many TTS providers still offer it as a way to adjust pronunciation. And, while many IPA pronunciations can be found through a quick search, what happens when you have a unique word that does not have a readily available pronunciation to copy and paste? Without a keen understanding of each IPA character and the precise sound it’s mapped to, crafting a pronunciation for a new word is a daunting task—does that “u” vowel sound like ʊ or ɤ? This process can be frustrating and at times unfruitful.
We strove to create our own respelling system that gave users a simpler approach to customized pronunciation. In this first iteration of our Respelling system we have begun with US English graphemic-phonemic pairings and have boiled down the phonemes that we hear to 12 vowel sounds, 6 vowel combination sounds, and 27 consonant sounds. A chart mapping these initial Respelling sound pairings can be found here.
Respellings can be useful in many different cases:
- When a word does not have an intuitive pronunciation. This includes the word “charcuterie” in the sentence, “Let’s make a charcuterie board!”
Without any modification this sounds like:
With the creative spelling “sharkooteree” you can hear that the pronunciation is closer, but the emphasis is on the incorrect syllable:
When trying to fix this with the creative spelling “sharKOOteree”, the pronunciation is thrown completely off; the AI wants to treat the capital letters like an initialism.
With the respelling “SHAR-KOO-dur-ee”, the pronunciation becomes:
- Words that require cultural understanding such as “Airbnb” in the sentence, “I’ve just booked an Airbnb.”
Without any modifications this sounds like:
Respelling “Airbnb” to “ERR-bee-ehn-BEE” in the sentence now sounds like:
- Specialized words such as legal and medical terminology. An example would be the new drug “Sotagliflozin” in the sentence, “Sotagliflozin was approved by the FDA on May 26th, 2023.”
Without any modifications, the AI does not know how to handle this unknown word:
Respelling “Sotagliflozin” to “soh-tuh-glih-FLOH-zihn” yields the correct pronunciation:
- Proper nouns such as names and places. Here’s an example of the actor Timothée Chalamet in the sentence, “Timothée Chalamet was nominated for Best Actor at the 2018 Academy Awards.
Without any modifications this sounds like:
Respelling “Timothée” to “TIH-muh-thee”and “Chalamet” to “SHA-luh-may” changes the pronunciation to:
Of course, by reducing the complexity that comes with IPA and other respelling systems, we made some tradeoffs. In an effort to eliminate the need for diacritics, one area of compromise was the representation of lexical stress. Words in the English language generally have up to two levels of stress, primary and secondary, indicating which syllables receive emphasis when spoken.
Similar to the aforementioned respelling system created by Wikipedia, we chose to use upper-case to indicate a stressed syllable. Because we switch from a system with three levels of stress (none, primary, and secondary) to a binary system (lower-case and upper-case), naturally we lose some information.
The loss of information is especially the case for words where primary stress precedes secondary stress such as the word “algorithm”. The IPA for “algorithm” is /ˈæl.ɡəˌrɪ.ðəm/ which would be respelled as “AL-guh-rih-dhuhm” rather than “AL-guh-RIH-dhuhm”, as the latter would suggest an IPA of /ˌæl.ɡəˈrɪ.ðəm/ where the third syllable is over-stressed and the first syllable is under-stressed.
Of course, we lose some information about the secondary stress of the third syllable, but this is necessary to maintain the cadence of the word and maintain the first syllable as the most stressed syllable. This is true of all cases where primary stress precedes secondary stress because for these words we want to ensure that the stress on syllables decreases after the syllable with primary stress occurs.
On the other hand, if secondary stress comes first in a word such as “academic”, which in IPA is /ˌæ.kəˈdɛ.mɪk/, the respelling would be “A-kuh-DEH-mihk”, where now both the primary and secondary stress are represented in the respelling. In addition to eliminating diacritics used to represent stress, we also eliminated tonal and inflection symbology since it is not commonly found in US English.
In order to stick to alphabetic symbology, we also had to forgo representing the glottal stop, a compromise made in many other respelling systems as well. The glottal stop is what is heard in the phrase “uh-oh” after “uh” and is represented by the symbol ⟨ʔ⟩ in IPA. As we have focused our efforts on standard US English, which does not make formal use of the glottal stop, we have chosen to delay inclusion of the glottal stop for a future version of our Respelling system.
As we consider what respellings will look like for other dialects of English, such as British Englishes, Australian Englishes, and African American Vernacular English, among others, the glottal stop as well as other symbologies will be incorporated. Other compromises we made in this first Respelling version include approximating /ə/ to UH, as well as excluding nasal sounds.
Oxford Languages Partnership: Respellings v2
Our first version of Respellings used the CMU Pronouncing Dictionary as a source of word transcriptions. While this dictionary is a great tool and provided WellSaid with the means to get started, it also proved to have its limitations, including a restrictive number of transcriptions available and no additional information about each word beyond its transcription.
To significantly increase our dataset size and gain more context surrounding each word, we have partnered with Oxford Languages, the world’s leading dictionary publisher. Oxford provides us with their most up-to-date syllabified IPA transcriptions of words. Using these IPA transcriptions, WellSaid maps words in scripts to our respelling system.
From there, we train our model on both the regular spellings of words and their respelled counterparts. Using this approach, our model learns how both graphemes and respellings translate to phonemes. The result for our users: the ability to use Respellings to get expected phonemes.
Oxford’s dictionary also provides us with information about some of the more subtle differences we hear in pronunciation. Each word in the dictionary is ascribed to a pronunciation group identifier. If two transcriptions fall into the same pronunciation group, the difference in pronunciation is subtle and does not affect the meaning of the word. For example, the word barista can be “bah-REE-stuh” (/bɑˈri.stə/) or “buh-RIH-stuh” (/bəˈrɪ.stə/), both of which would fall into the same group.
When a word has a pronunciation group with multiple entries, without manually mapping audio to pronunciation, we have no way of knowing which pronunciation a voice actor used. Did they say “bah-REE-stuh” or “buh-RIH-stuh”? While there exists an authoritative pronunciation within each pronunciation group, we do not want to overwrite the pronunciation used by the voice actor.
For this reason, we stick to training only on words that have no ambiguity in pronunciation, ensuring the model learns the correct mapping between graphemes and phonemes and leaving room for flexibility surrounding words that have variation in pronunciation.
All in all, we are excited to have Respellings available as a tool to allow users to guide and handcraft pronunciations in their work. In the realm of sonic branding, the power of Respellings reveals itself as an avenue through which creators can authentically replicate brand voices. Imagine enabling your brand to communicate with a unique, consistent voice, tailor-made by the nuanced control of pronunciation using Respellings.
This not only elevates the uniformity in verbal communication across different platforms but also ensures that the sonic signature remains true to your brand identity, rendering a more resonant and memorable auditory experience to your audience. In essence, Respellings empowers brands and creators alike, to guide the auditory canvas with precision, ensuring every syllable mirrors the brand’s character and ethos, crafting a sonic identity that is undeniably and unmistakably yours.
While this is only the first iteration of Respellings, based solely on US English graphemic-phonemic relations, we are dedicated to continue researching how Respellings can be applied to other dialects of English, as well as other languages, with every new iteration.
Moreover, as we continue to mature the system, we strive towards including sounds that were not included in this iteration, such as the glottal which we recognize is an important sound found not only in several dialects of English but also in other languages.
That being said, at WellSaid, Respellings is just one step in the process of our effort to capture the intricacies that make speech unique.