Updated 9 February 2026 at 18:07 IST

‘Pathbreaking AI’: Amitabh Kant Hails Sarvam AI’s Indigenous Text-to-Speech Model, Bulbul V3

Sarvam CEO Pratyush Kumar says Bulbul V3 is designed to generate natural, expressive speech for Indian languages and to hold up in production use cases.

Follow :  
×

Share


Sarvam AI launched its flagship TTS AI model, called Bulbul V3. | Image: Republic

Amitabh Kant has praised Sarvam AI’s new text-to-speech (TTS) model, Bulbul V3, calling it “pathbreaking AI” in a post on X. The endorsement comes as the Indian startup races to build speech and language tools tuned for local accents, code-mixed inputs, and real-world phone-line quality, areas where global voice models can still stumble.

“Incredible achievement by @SarvamAI. They are developing pathbreaking AI that leads global benchmarks for Indic language vision models across India's 22+ languages. A Gr8 example of homegrown, sovereign AI attuned to our multilingual cultures and local contexts. This is key for unlocking our full potential,” said Kant, former CEO of Niti Aayog, in his post.

Sarvam CEO Pratyush Kumar says Bulbul V3 is designed to generate natural, expressive speech for Indian languages and to hold up in production use cases like voice agents, customer support, and education. In its announcement, the company argues that “Indian speech is complex by default,” pointing to frequent code-switching, region-specific accents, and the need to handle names, abbreviations, and numerics accurately.

What Bulbul V3 claims to improve

Sarvam says Bulbul V3 advances on three dimensions it considers crucial for deployable speech systems: naturalness, robustness, and stability.

  1. For naturalness, Sarvam says Bulbul V3 uses a large language model to infer prosody, pauses, emphasis, pacing and tone, rather than reading text as a flat sequence.
  2. For stability, it claims lower rates of word skips and mispronunciations, which matters in high-volume voice applications where small errors can break workflows.
  3. For robustness, Sarvam says the model performs well on code-mixed and messy inputs, including numerics, URLs, abbreviations and Romanised text, measured using character error rate (CER) on India-relevant domains.

How Sarvam tested it

Sarvam says Bulbul V3 was evaluated in a third-party, blind A/B human listening study across 11 languages, run by Josh Talks. The company says the study compared Bulbul V3 against models such as ElevenLabs (v3 alpha and v2.5 flash) and Cartesia Sonic‑3, across both full-band audio and 8 kHz telephony-grade audio. Sarvam claims Bulbul V3 performed best in the 8 kHz telephony evaluations, a format that better reflects real call-centre and voice-agent conditions.

Languages, voices, and voice cloning

Sarvam says Bulbul V3 offers 30+ voices across 11 Indian languages and plans to expand to 22 scheduled Indian languages. It also supports voice cloning for creating custom voices, with the company describing this as consent-based and aimed at enterprise use.

What this means

If Bulbul V3 holds up outside lab benchmarks, it could make it easier for Indian apps to ship voice-first experiences that sound local, stay accurate on numbers and names, and work reliably on phone networks.

Read more: SarvamAI’s Daily Drops: Bulbul V3, Sarvam Vision, Samvaad, Audio, and Dub

Published By : Shubham Verma

Published On: 9 February 2026 at 18:07 IST