Name: VAANI
Creator: Indian Institute of Science
License: https://creativecommons.org/licenses/by/4.0/

Background and Development

VAANI (wey mean "voice" or "speech" for plenty Indian languages) na creation of Indian Institute of Science (IISc) for Bangalore as part of effort to fix the big gap for speech data resources for Indian languages. India language landscape dey among the most diverse for the world, wey get hundreds of languages and thousands of dialects, but the majority of the speech datasets wey dey exist dey focus on small set of well-resourced languages. VAANI na develop to increase the availability of training data for this underserved linguistic space by doing large-scale field data collection for different geographical and linguistic regions of the country.

Dem gather data from about 110,000 speakers wey dey spread across 120 districts for 22 Indian states, wey show say dem dey try capture genuine regional and dialectal variation instead of relying on controlled studio recordings. The dataset dey cover 86 languages and dialects, from major scheduled languages like Hindi, Tamil, Telugu, Bengali, Kannada, and Malayalam to plenty regional varieties and tribal languages like Gondi, Santali, Kurukh, Wancho, and Tenyidie, among many others.

Dataset Composition and Key Features

VAANI get about 21,500 hours of audio in total, making am one of the largest multilingual speech corpora wey focus on Indian languages. From this, 835 hours don dey transcribed, wey provide ground-truth text annotations for some part of the collection. The dataset also dey include multimodal elements, wey make am fit use beyond normal speech tasks.

Key characteristics of the dataset include:

Coverage of 86 languages and dialects, including plenty low-resource and tribal varieties
Contributions from around 110,000 speakers across different geographic and demographic backgrounds
21,500 total hours of audio with 835 hours of transcribed speech
Field recordings wey dey span 120 districts for 22 Indian states
Released under a CC BY 4.0 license, wey allow broad reuse with attribution
Support for both unimodal and multimodal research applications

Supported Tasks and Use Cases

VAANI dey designed to serve as resource for plenty speech and language processing tasks. Researchers and developers fit use am for training and evaluating automatic speech recognition (ASR) systems, text-to-speech (TTS) synthesis, speaker identification, and language identification models. The dataset also dey support work on speech enhancement and the development of multimodal large language models (LLMs). The wide range of language coverage make am particularly relevant for benchmarking systems wey wan operate across India's diverse linguistic environment.

As dem include plenty low-resource languages and dialects wey little or no prior speech data dey exist for, VAANI get particular value for researchers wey dey focus on building inclusive language technologies. E provide foundation for developing ASR and natural language processing tools wey fit serve communities wey their languages don dey absent from mainstream AI development.

Significance for Indian Language Technology

The scale and linguistic diversity of VAANI make am notable contribution to the field of multilingual speech research, especially within the context of South Asian languages. By documenting spoken varieties from tribal, rural, and regional communities alongside more widely spoken languages, the dataset dey capture dimension of India's linguistic heritage wey no dey often represented for computational resources. E open license dey facilitate use by academic institutions, government bodies, and industry researchers wey dey work toward more inclusive and representative speech technologies.

VAANI

Categories

Themes

Background and Development

Dataset Composition and Key Features

Supported Tasks and Use Cases

Significance for Indian Language Technology