NVIDIA has released an open-source multilingual speech AI dataset called Granary, which features nearly one million hours of audio, alongside new AI models optimized for transcription and translation across 25 European languages. The launch, announced August 15, 2025, aims to bridge the gap in speech technologies for underrepresented languages. The research team will present the Granary paper at Interspeech 2025 in the Netherlands this week.
Of the world’s 7,000 languages, only a fraction are supported by AI systems. NVIDIA’s Granary dataset strives to address this imbalance by including approximately 650,000 hours of speech recognition data and over 350,000 hours of speech translation data. It covers widely spoken European languages as well as lesser-resourced ones, such as Croatian, Estonian, and Maltese.
The dataset was developed in collaboration with Carnegie Mellon University and Fondazione Bruno Kessler, using the NVIDIA NeMo Speech Data Processor to transform vast amounts of unlabeled audio into clean, structured data. The company claims that this process significantly reduces reliance on costly human annotation, making large-scale AI training more efficient and inclusive. According to NVIDIA’s official announcement, Granary achieves target accuracy for automatic speech recognition and translation with about half the training data required by other popular corpora.
Canary and Parakeet: AI Models Built on Granary
Two new models showcase Granary’s potential:
-
NVIDIA Canary-1b-v2: A billion-parameter model that NVIDIA says tops Hugging Face’s leaderboard for multilingual speech recognition and runs inference up to 10 times faster than comparable large models. It supports transcription and translation between English and two dozen languages.
-
NVIDIA Parakeet-tdt-0.6b-v3: A high-throughput model designed for real-time or large-scale transcription. NVIDIA states that it is capable of transcribing 24-minute audio clips in one pass, and it automatically detects language input and delivers results with low latency.
Both models produce outputs with punctuation, capitalization, and timestamps, making them ready for production environments such as customer service bots, multilingual chat tools, and near-real-time interpretation systems.
Why It Matters for Localization
For the localization industry, Granary and its companion models signal progress toward more inclusive, scalable speech technologies. By providing open access to multilingual resources, NVIDIA is equipping developers to expand AI-powered services across Europe’s diverse linguistic landscape. The implications are significant: improved access to underrepresented languages, faster model development through NVIDIA NeMo, and a broader ecosystem of speech-enabled applications.
As demand grows for real-time, accurate voice technologies, Granary’s open-source foundation may inspire similar initiatives beyond Europe. For now, its release provides the localization community with a dataset designed not only to scale technology, but also to reflect linguistic diversity.

