Defining Naturalness as Primary Driver for Synthetic Voice Quality

Audio by Tilda C. using WellSaid Labs

This post is from the WellSaid Research team, exploring breakthroughs and thought leadership within audio foundation model technology.

The prospect of machines mimicking human speech so well that our minds are unable to discern the difference seems straight out of a sci-fi novel. But for us at WellSaid Labs, it's just another day at the office.

Over the past few years, we’ve burgeoned in both our size and capabilities. Since our last update, where we celebrated achieving human parity for naturalness based on Mean Opinion Score (MOS), we have been immersed in a whirlwind of progress and pioneering advancements. In this post, we briefly outline the process we follow at WellSaid to obtain MOS rankings. More importantly, we take a step back for a moment to engage in recent scholarship on MOS – what are its shortcomings? What information does MOS provide, and where does it fall short?

Finally, we explore the nuances of evaluating synthetic speech quality, the challenges associated with quantifying "naturalness," and how our overarching commitment to human-like authenticity continues to shape our trailblazing work.

WellSaid Labs’ MOS Testing Protocols

At WellSaid, our MOS testing follows these protocols:

We send a set of randomized recordings, created by both synthetic and human voices, to a third party data labeler
Native listeners are sourced for each regional voice. For instance, our US: Mountain Rural voice is evaluated by listeners from the US: Mountain Rural region
For each file, participants are instructed to first listen to the clip then answer the question, “How natural (i.e. human-sounding) is this recording?”
The participant is presented with a half-point escalating Likert scale using the following labels and descriptors:
5 - Excellent: Completely natural
4.5 - Very Good: Mostly natural with a minor issue
4 - Good: Mostly natural with multiple minor issues
3.5 - Satisfactory: Generally natural with minor issues
3 - Fair: Equally natural and unnatural
2.5 - Poor: Starts fairly natural
2 - Poor: Mostly unnatural
1.5 - Very Poor: Mostly unrecognizable speech
1 - Bad: Completely unnatural
The participant selects a numerical value based on their interpretation of the performance
WellSaid Labs compares the results at the actor level, so that each human voice’s scores are compared directly to the correlating synthetic voice’s scores

Up until 2020, we used Amazon MTurk to measure several models. Since 2020, we have shifted toward conducting our evaluations using a third party labeling company for data quality evaluations. We felt we had outgrown MTurk and needed a dedicated, high quality labeling partner to give us deeper insight into our TTS quality. We were able to expand the survey beyond the 5-point Likert scale, and required listeners to identify specific issues which impeded the experience of natural speech, such as mispronunciation, gibberish, synthetic sounds, and background noise.

Since achieving human parity in June 2020, we have continued to receive human-parity MOS rankings using this same methodology.

The Challenges of MOS as a Sole Metric for Synthetic Voice Quality

Within the TTS industry, MOS has allowed synthetic speech researchers and companies to provide a single metric to showcase improvements in their work. While the technology continues to make huge strides, the process for evaluation has not. MOS for naturalness is ubiquitous, still, within the industry. This shows that the primary marker of synthetic voice quality continues to be no more than a general feel of humanness for a speech utterance, regardless of any other rhetorical contexts which may impact that utterance’s reception out in the wild.

In one sense, this indicates that we have collectively accomplished something significant – a synthetic utterance is unrecognizable from a human’s! – and that we have far to go: the only marker of quality we can measure and attempt to compare is still unsituated from any kind of authentic context.

If we were to compare, in-context, (a) a human reacting to a given prompt to (b) a prediction provided by a synthetic voice model reacting to the same prompt, parity would surely be a higher bar to reach.

A growing number of research papers have pointed toward several shortcomings for MOS as the gold standard for synthetic speech quality, as well as a general lack of rigor within the confines of MOS for naturalness as it stands [1, 2, 3, 4]. For one, MOS results are not truly comparable between models, as each MOS evaluation relies on its own very different set of conditions [1]. One MOS result may have asked listeners to rate naturalness, while another asked them to rate quality. One may have used a 5-point Likert scale with each numerical value labeled, while another may have provided a six-point unlabeled scale. The number of testing variables, and the lack of transparency around testing conditions when publishing MOS results, diminish the comparability of scores.

The Complexities of MOS, Correctness, and Context

As synthetic speech researchers and providers, and perhaps more broadly, as English language speakers and listeners, we have no universally agreed-upon definition of “utterance correctness.” There are too many possible performances of correctness. This can make evaluation exceedingly challenging, and it is the primary reason why the Mean Opinion Score remains our go-to metric. Relying on human evaluation does not provide a wholly satisfactory answer to the problem of evaluating speech, however. While the MOS will capture a range of opinions about the naturalness (or quality, or correctness) of an utterance, it is also at the same time capturing a range of opinions around what constitutes “naturalness” for an utterance. Thus, the definition of naturalness (or quality, or correctness) is being defined at the same moment of evaluation.

To illustrate this problem in a concrete way, consider the following audio examples. Both “speech naturalness” and “speech quality” are complex ideas, with countless interpretations of the parameters that constitute either “naturalness” or “quality.” Within each potential parameter, there are additional countless ways of deeming correctness.

For instance, one component of naturalness for speech is the intonation pattern for a sentence ending in a question mark: Can you think of any other questions?

This sample sentence can follow any of these intonation pattern and be considered natural:

Example 1

Example 2

Example 3

Example 4

However, depending on the context of the question and the intended tone of the piece, only a few of these samples would earn a high score for correctness.

For instance, let’s say the question is part of an eLearning module. The tone of this slide is to be warm and inviting, and the final question should facilitate student engagement. Under those parameters, example 4 would undoubtedly receive a higher percentage of incorrect scores than example 3.

Thus, listeners who are evaluating any of these audio files within a broadly presented context – “How natural (i.e. human-sounding) is this recording?” – are each drawing on their own concepts of what naturalness entails, its relationship to correctness, and, lastly, how this specific performance works within those highly individualistic and varied evaluation parameters.

And, while some researchers have suggested shifting entirely away from a broad sweeping metric like MOS in favor of a content-specific evaluation tool, the scope for building and carrying out this proposal is significant. [2] By measuring contextual appropriateness as the primary driver for speech quality, one would first need to identify a specific context and content within which the synthetic voice will perform before it can be evaluated. We again run into the same problem as above, where quality is being evaluated at the same time as other defining qualities. This becomes less problematic when a synthetic voice is intended to be used only for a single use case. For a TTS provider that is looking to evaluate naturalness at scale to accommodate a variety of use cases, as WellSaid Labs’ voices are, this testing method does not suit the providers’ needs. In such cases, voice quality is measured alongside the testers’ correct selection of content, the delivery style of the original voice talent, and the users’ ability “express an informed opinion about their expectation of a TTS voice” and make evaluations accordingly [2].

Still other suggestions for improving the state of the industry include:

Transparency in what the MOS testing entailed (providing a copy of listener instructions, scale labels, screenshots of the testing environment, and other pertinent testing details) [1]
Coupling MOS results with complementary tests such as MUSHRA or AB testing or CMOS ranking [1, 3, 4]

These options would undoubtedly provide more rigor to the standalone MOS rank, and they further indicate that the synthetic speech industry at large has outgrown a single quality metric like MOS. Moreover, “as systems progress, the focus has shifted from obvious artifacts (e.g., robotic voices) to more subtle errors, such as inappropriate prosody or mispronunciations” [5]. MOS is, if nothing else, a perfectly valid blunt tool for the first push in synthetic speech. As we begin to ask more of synthetic speech, more nuanced evaluation tools are required.

Deconstructing Human Naturalness as an Evaluation Criteria

Taking one big step back, we want to conduct an in-depth exploration of the initial motivation that prompted WellSaid to measure human parity via MOS in 2020. Our number one measure of quality is, and has always been, human naturalness. We do not prioritize correctness or precision, as we do not believe these are the core defining qualities of human naturalness in speech. This guiding belief in the importance of capturing and recreating human naturalness in speech has shaped our technology at every stage, from the script libraries we’ve built for our voice talent to read from, to the instructions we give talent, and, more recently, to the ways we iterate on our core TTS algorithms.

To move, then, toward a quality metric that brings the nuance we see lacking in MOS, we first need to define and, where possible, quantify what is meant by human naturalness for synthetic speech. One way we have begun approaching this task at WellSaid Labs is to collect and analyze the moments where a listener experiences disconnect in their listening experience. We can then begin to quantify those moments of disconnect and develop metrics to track our progress.

Within our internal testing, we have begun this task by looking at comments that directly reference speech that is less natural, awkward, or robotic. Some of this anecdotal data is less central to our definition of naturalness, and we have set these pieces of feedback aside. For instance, any commentary around how conversational a voice sounds, versus sounding like it is being read aloud, is primarily a criticism of the performance provided by our original voice talent, or a delivery style mismatch for a particular production.

Instead, we think of naturalness as a synthetic voice likeness’ ability to accurately replicate the linguistic characteristics and mannerisms unique to a specific vocal dataset. Conversely, anything that is unnatural is a performance that lies outside of what we would reasonably expect the human behind that dataset to deliver, if provided with the same script.

In this work, we have found that the following components are directly related to a listener’s naturalness evaluation of WellSaid’s synthetic voices: prosody, pronunciation accuracy, and audio quality. Within each of these broader categories, we have begun identifying specific ways in which our current synthetic voices succeed, and also fall short of, naturalness.

Prosody

Also referred to as intonation, cadence, flow, rhythm, tone, pronunciation, and (incorrectly) inflection, an utterance’s prosody is the primary driver of naturalness for our current synthetic speech model. Our listeners’ feedback regarding WellSaid Labs’ speech naturalness with regard to prosody can be further broken down into these categories:

The AI makes correct pitch and rhythm predictions throughout the utterance, based on expected patterns.

Sample feedback: There were some unnatural/unnecessary pauses within sentences
Sample feedback: Sometimes had odd inflections or upspeak where I wouldn't expect
Sample feedback: The tone includes modulations that help the voice sound more realistic

Example 5 (unnatural)

Example 6 (natural)

script: A lunar month is measured as the time the moon takes to fully rotate around the Earth, while a year is measured as the time Earth rotates around the Sun. [6] -- Voiced by Jack C. using WellSaid Labs

Example 7 (unnatural)

Example 8 (natural)

script: The framework of the story tells the tale of a flower that gives life to a conscious stone in the mythic world of the past; in the main narrative, the flower and the stone are both incarnated as humans and become part of a love triangle. [7] -- Voiced by Terra G. using WellSaid Labs

Each sentence ends with the right pitch and rhythm predictions based on the final punctuation mark. This is particularly true for questions that anticipate a yes or no answer.

Sample feedback: The question at the end “can you enter your name please” turned down in a way a human wouldn’t

Example 9 (unnatural)

Example 10 (natural)

script: Let's get started. Can you enter your name please? -- Voiced by Jimmy J. using WellSaid Labs

Example 11 (unnatural)

Example 12 (natural)

script: Discuss the speaker's imagery of distance from the divine and mortality in "Contemplations." Does the vastness of nature contribute to the distance? [7] -- Voiced by Ava M. using WellSaid Labs

Specific difficult phrases are delivered with correct rhythm and emphasis (often labeled pronunciation by listeners).

Sample feedback: The pronunciation of “off gas grid areas” sounds awkward without adjustments in both versions

Example 13 (unnatural)

Example 14 (natural)

script: With little research having been undertaken into how LCTTs are configured in a residential retro fit context -- particularly in rural, off-gas grid areas -- this study contributes to two under-researched areas. [8] -- Voiced by Fiona H using WellSaid Labs

Pronunciation

Word correctness still remains a relevant marker of quality for synthetic speech. Within the context of naturalness, however, pronunciation seems to mean more than simple word correctness. Slurring, for instance, is a type of unnatural pronunciation.

Sample feedback: The model says "address" in an unnatural way. The updated model fixes this and pronunciation and in general sounds more natural.

Example 15 (unnatural)

Example 16 (natural)

script: Kant's theories of individualism and humanism set him at odds with the monarchy - and the church, of course. [7] -- Voiced by Kari N. using WellSaid Labs

Audio Quality

We will spend more time in future posts discussing audio quality – a topic easily worth its own focus – in greater depth. For the purpose of the conversation around naturalness, some specific audio quality artifacts can lead to a feeling of unnaturalness for listeners. Tinny reverb and pitch quantization are the most prominent audio artifacts associated with a robotic delivery.

Example 17 (unnatural)

Example 18 (natural)

script: The specific gravity of soil solid is used in calculating the phase relationships of soils, such as the void ratio and the degree of saturation. [9] -- Voiced by Wade C. using WellSaid Labs

WellSaid’s Continuous Pursuit of Voice Naturalness—From Replication to Refinement

Standing at the precipice of an era where synthetic meets authentic, we see our journey at WellSaid Labs is far from complete. We're proud of our strides, but there's no resting on laurels here. You can anticipate more frequent publications from us, shedding light on the nuances of our journey and the breakthroughs we achieve. Our focus remains on honing the symbiotic dance between user and model – a relationship we believe is the linchpin in delivering a product that reverberates with genuine naturalness.

As we move forward, our strategies are evolving too. Instead of merely offering post-processing on synthetic outputs, we are pouring our energy into enhancing our datasets to ensure the highest audio quality from the get-go. This is just one of the tangible ways we are actualizing our unwavering commitment to naturalness.

In the rapidly evolving realm of synthetic speech, WellSaid Labs remains dedicated to a singular vision: moving beyond replicating the human voice to refining and celebrating its nuances.

References

[1] Kirkland, Ambika, Shivam Mehta, Harm Lameris, Gustav Eje Henter, Eva Szekely, and Joakim Gustafson. "Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation." In 12th Speech Synthesis Workshop (SSW12). 2023.

[2] Wagner, Petra, Jonas Beskow, Simon Betz, Jens Edlund, Joakim Gustafson, Gustav Eje Henter, Sébastien Le Maguer et al. "Speech synthesis evaluation—state-of-the-art assessment and suggestion for a novel research program." In Proceedings of the 10th Speech Synthesis Workshop (SSW10). 2019.

[3] Camp, Joshua, Tom Kenter, Lev Finkelstein, and Rob Clark. "MOS vs. AB: Evaluating Text-to-Speech Systems Reliably Using Clustered Standard Errors." In Proc. Interspeech. 2023.

[4] Cooper, Erica, and Junichi Yamagishi. "Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech." arXiv preprint arXiv:2305.10608. 2023.

[5] Sellam, Thibault, Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, and Jason Riesa. "SQuId: Measuring speech naturalness in many languages." In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. 2023.

[6] Burger, Benjamin J., "The Essential Guide to Planet Earth.” 2020. https://digitalcommons.usu.edu/oer_textbooks/7

[7] Turlington, Anita, Matthew Horton, Karen Dodson, Laura Getty, Kyounghye Kwon, and Laura Ng. Compact Anthology of World Literature II: Volumes 4, 5, and 6. University System of Georgia, 2022. https://open.umn.edu/opentextbooks/textbooks/compact-anthology-of-world-literature-ii-2022-1236

[8] Wrapson, Wendy, and Patrick Devine-Wright. "‘Domesticating’ low carbon thermal technologies: Diversity, multiplicity and variability in older person, off grid households." Energy Policy 67 (2014): 807-817.

[9] Hossain, M. D., M. D. Islam, Faria Fahim Badhon, and Tanvir Imtiaz. PROPERTIES AND BEHAVIOR OF SOIL-ONLINE LAB MANUAL. Mavs Open Press, 2022.

ALL blog posts