Categories
Themes
VAANI na large-scale multilingual speech dataset wey Indian Institute of Science (IISc), Bangalore develop, e get around 21,500 hours of audio wey dem collect from about 110,000 speakers for 120 districts for 22 Indian states. E cover 86 languages and dialects wey people dey speak for India, including major scheduled languages and plenty regional and tribal varieties, with 835 hours of transcribed speech.
The dataset dey designed to support different speech and language tasks like automatic speech recognition, text-to-speech synthesis, speaker identification, language identification, speech enhancement, and multimodal language model development. E don release under CC BY 4.0 license, and e dey meant mainly for training and benchmarking AI systems, especially for researchers and developers wey dey work on low-resource and underrepresented Indian languages.
Background and Development
VAANI (wey mean "voice" or "speech" for plenty Indian languages) na creation of Indian Institute of Science (IISc) for Bangalore as part of effort to fix the big gap for speech data resources for Indian languages. India language landscape dey among the most diverse for the world, wey get hundreds of languages and thousands of dialects, but the majority of the speech datasets wey dey exist dey focus on small set of well-resourced languages. VAANI na develop to increase the availability of training data for this underserved linguistic space by doing large-scale field data collection for different geographical and linguistic regions of the country.
Dem gather data from about 110,000 speakers wey dey spread across 120 districts for 22 Indian states, wey show say dem dey try capture genuine regional and dialectal variation instead of relying on controlled studio recordings. The dataset dey cover 86 languages and dialects, from major scheduled languages like Hindi, Tamil, Telugu, Bengali, Kannada, and Malayalam to plenty regional varieties and tribal languages like Gondi, Santali, Kurukh, Wancho, and Tenyidie, among many others.
Dataset Composition and Key Features
VAANI get about 21,500 hours of audio in total, making am one of the largest multilingual speech corpora wey focus on Indian languages. From this, 835 hours don dey transcribed, wey provide ground-truth text annotations for some part of the collection. The dataset also dey include multimodal elements, wey make am fit use beyond normal speech tasks.
Key characteristics of the dataset include:
- Coverage of 86 languages and dialects, including plenty low-resource and tribal varieties
- Contributions from around 110,000 speakers across different geographic and demographic backgrounds
- 21,500 total hours of audio with 835 hours of transcribed speech
- Field recordings wey dey span 120 districts for 22 Indian states
- Released under a CC BY 4.0 license, wey allow broad reuse with attribution
- Support for both unimodal and multimodal research applications
Supported Tasks and Use Cases
VAANI dey designed to serve as resource for plenty speech and language processing tasks. Researchers and developers fit use am for training and evaluating automatic speech recognition (ASR) systems, text-to-speech (TTS) synthesis, speaker identification, and language identification models. The dataset also dey support work on speech enhancement and the development of multimodal large language models (LLMs). The wide range of language coverage make am particularly relevant for benchmarking systems wey wan operate across India's diverse linguistic environment.
As dem include plenty low-resource languages and dialects wey little or no prior speech data dey exist for, VAANI get particular value for researchers wey dey focus on building inclusive language technologies. E provide foundation for developing ASR and natural language processing tools wey fit serve communities wey their languages don dey absent from mainstream AI development.
Significance for Indian Language Technology
The scale and linguistic diversity of VAANI make am notable contribution to the field of multilingual speech research, especially within the context of South Asian languages. By documenting spoken varieties from tribal, rural, and regional communities alongside more widely spoken languages, the dataset dey capture dimension of India's linguistic heritage wey no dey often represented for computational resources. E open license dey facilitate use by academic institutions, government bodies, and industry researchers wey dey work toward more inclusive and representative speech technologies.