Kyutai TTS

Kyutai TTS: The First Real-Time Text-to-Speech Model
Kyutai TTS is a cutting-edge text-to-speech model designed for real-time use. Originally developed as an internal tool for Moshi, Kyutai's AI assistant, this model has been refined and open-sourced for public use. The current version, kyutai/tts-1.6b-en_fr, features 1.6 billion parameters and introduces several innovations that make it particularly useful for real-time applications.
Benefits
Kyutai TTS sets a new standard in text-to-speech technology. It excels in metrics like word error rate (WER) and speaker similarity, ensuring that the generated audio closely matches the script and the original sample's voice quality when cloning. Unlike other TTS models, Kyutai TTS doesn't require the entire text in advance. It has a latency of just 220ms from receiving the first text token to generating the first chunk of audio. This makes it ideal for use with language models (LLMs) that generate text in real-time. Kyutai TTS can start processing text as it's being generated by an LLM, leading to ultra-low latency. This feature is particularly useful in low-resource environments or when generating long chunks of text.
Use Cases
Kyutai TTS can clone voices using a 10-second audio sample, matching the voice, intonation, mannerisms, and recording quality of the source audio. To ensure ethical use, the voice embedding model is not directly released. Instead, a repository of voices based on samples from datasets like Expresso and VCTK is provided. Users can also help expand the voice library by anonymously donating their voice.
Most TTS models struggle with generating audio longer than 30 seconds, but Kyutai TTS can handle much longer audio without issues. It comes with a robust Rust server that provides streaming access to the model over websockets, making it production-ready. On a L40S GPU, it can serve 16 simultaneous connections at a real-time factor of over 2x.
Kyutai TTS outputs exact timestamps for each word, which can be useful for providing real-time subtitles. This feature is utilized in Unmute, an application that uses Kyutai TTS, to handle interruptions and resume conversations seamlessly.
Additional Information
Kyutai TTS, along with Kyutai STT and Unmute, was created by Alexandre Défossez, Edouard Grave, Eugene Kharitonov, Laurent Mazare, Gabriel de Marmiesse, Emmanuel Orsini, Patrick Perez, Václav Volhejn, and Neil Zeghidour, with support from the rest of the Kyutai team.
Kyutai TTS is a significant advancement in text-to-speech technology, offering real-time capabilities, voice cloning, and long-form generation. Its open-source nature and ethical considerations make it a standout choice for developers and researchers in the field of AI and conversational interfaces.
Comments
Please log in to post a comment.