BASE TTS: Audio samples

Audio samples for the paper "BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data".

Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion- parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS.

Below are selected samples produced by the model. There's no annotation on the input text and no post-processing on the audio.

English

At the conference, the professor, Mark Curtis, who researched the phenomena that the student who presented earlier had focused on made a surprising revelation that shocked the audience.

Voice: Speaker A

His latest invention (a device meant to assist in everyday chores (something he never seemed to run out of)), was nothing short of brilliant.

Voice: Speaker B

Overwhelmed with confusion and despair, David Darlan cried out, "What do you want from me? Why can't you just tell me what's wrong?"

Voice: Speaker B

After getting to his car he said, "Oh great, another Monday, I just can't wait to sit in traffic for an hour and spend the next 8 hours staring at a computer screen."

Voice: Speaker B (sarcastic)

With an ample supply of joie de vivre, Mary danced through the streets of Nice, stopping only to enjoy a nice café with a warm croissant. How French!

Voice: Speaker C

His face lit up with pure delight as he exclaimed, "We did it! We won the championship! I knew we could do it together!"

Voice: Speaker C

"I went through all of this trouble, buying flowers, chocolate, and even organizing a flash mob, and she's still rejecting me?"

Voice: Speaker C

A profound sense of realization washed over Matty as he whispered, "You've been there for me all along, haven't you? I never truly appreciated you until now."

Voice: Speaker D

Beth collapsed into his arms, sobbing uncontrollably, "I failed them, I failed them all. They’re all dead! Nothing we can do will ever bring them back. How can I ever live with myself again? How?"

Voice: Speaker D

"Uh, are you sure about this?" Tim asked nervously, looking at the steep slope before them. "Whoa, it's higher than I thought," he continued, his voice filled with trepidation. "Aha, but look at the view," Emily responded with excitement, "it's worth the climb!"

Voice: Speaker D

Spanish

Con los ojos muy abiertos de terror, gritó: «¡Los frenos no funcionan! ¿Qué hacemos ahora? ¡Estamos completamente atrapados!»

Voice: Speaker E

Beth cayó en sus brazos, sollozando incontrolablemente: «Les fallé, les fallé a todos. ¡Están todos muertos! Nada de lo que podamos hacer los devolverá jamás. ¿Cómo podré volver a vivir conmigo mismo? ¿Cómo?»

Voice: Speaker E

David le susurró a Emily mientras las luces se apagaban en el teatro: «Shh, ya está empezando».

Voice: Speaker E

Con las manos temblorosas de emoción, Alice Monroe tartamudeó: «Oh... ¡no puedo creerlo! ¿Es realmente mi carta de admisión a Harvard?» Marco tampoco puede creerlo: «¡Maldita sea! ¿Cómo lo lograste?»

Voice: Speaker F

Durante la reunión, en la que los ejecutivos de Coca-Cola debatieron sobre el futuro de la empresa, Thomas, un joven becario que había descubierto una solución, se armó de valor para hablar, cambiando el rumbo de la conversación que había precedido a su intervención.

Voice: Speaker F

IN, TS, Hyderabad
Welcome to the Worldwide Returns & ReCommerce team (WWR&R) at Amazon.com. WWR&R is an agile, innovative organization dedicated to ‘making zero happen’ to benefit our customers, our company, and the environment. Our goal is to achieve the three zeroes: zero cost of returns, zero waste, and zero defects. We do this by developing products and driving truly innovative operational excellence to help customers keep what they buy, recover returned and damaged product value, keep thousands of tons of waste from landfills, and create the best customer returns experience in the world. We have an eye to the future – we create long-term value at Amazon by focusing not just on the bottom line, but on the planet. We are building the most sustainableRead more