21,500 hours CC-BY 4.0 Training Llapan simikuna

Rikch'aykuna

voice audionlp

VAANI huk'uchaykuykuna ruwasqa multilingüe qhichwaq dataset, Indian Institute of Science (IISc), Bangalore, llaqtapi ruwasqa, 21,500 horas audio chaymanta 110,000 hablantes qhichwaq 120 distritos 22 Indian estados. Chayqa 86 simi y dialectos India llaqtapi, simi chaykuna qhichwaq, chaymanta huk'uchaykuykuna regional y tribal qhichwaq, 835 horas qhichwaq ruwasqa.

Chay datasetqa ruwasqa huk'uchaykuykuna qhichwaq ruwasqa, simi y qhichwaq ruwasqa, automatic speech recognition, text-to-speech synthesis, hablante identificación, simi identificación, speech enhancement, y multimodal language model ruwasqa. CC BY 4.0 licencia chaymanta, AI sistemas ruwasqa, huk'uchaykuykuna training y benchmarking, chayqa low-resource y underrepresented Indian simikuna ruwasqa.

Ñawpa y Desarrollo

VAANI (que significa "sumaq" o "ñawpa" en varios idiomas indios) was created by the Indian Institute of Science (IISc) in Bangalore as part of an effort to address the significant gap in speech data resources for Indian languages. India's linguistic landscape is among the most diverse in the world, encompassing hundreds of languages and thousands of dialects, yet the majority of existing speech datasets focus on a narrow set of well-resourced languages. VAANI was developed to expand the availability of training data for this underserved linguistic space by conducting large-scale field data collection across geographically and linguistically diverse regions of the country.

Data was gathered from approximately 110,000 speakers spread across 120 districts in 22 Indian states, reflecting a deliberate effort to capture genuine regional and dialectal variation rather than relying on controlled studio recordings. The dataset spans 86 languages and dialects, ranging from major scheduled languages such as Hindi, Tamil, Telugu, Bengali, Kannada, and Malayalam to numerous regional varieties and tribal languages including Gondi, Santali, Kurukh, Wancho, and Tenyidie, among many others.

Dataset Composition and Key Features

VAANI comprises approximately 21,500 hours of audio in total, making it one of the largest multilingual speech corpora focused on Indian languages. Of this, 835 hours have been transcribed, providing ground-truth text annotations for a subset of the collection. The dataset also incorporates multimodal elements, positioning it for use beyond conventional speech tasks.

Key characteristics of the dataset include:

  • Coverage of 86 languages and dialects, including many low-resource and tribal varieties
  • Contributions from around 110,000 speakers across diverse geographic and demographic backgrounds
  • 21,500 total hours of audio with 835 hours of transcribed speech
  • Field recordings spanning 120 districts in 22 Indian states
  • Released under a CC BY 4.0 license, permitting broad reuse with attribution
  • Support for both unimodal and multimodal research applications

Supported Tasks and Use Cases

VAANI is designed to serve as a resource for a wide range of speech and language processing tasks. Researchers and developers can use it for training and evaluating automatic speech recognition (ASR) systems, text-to-speech (TTS) synthesis, speaker identification, and language identification models. The dataset also supports work on speech enhancement and the development of multimodal large language models (LLMs). Its breadth of language coverage makes it particularly relevant for benchmarking systems intended to operate across India's diverse linguistic environment.

Given the inclusion of many low-resource languages and dialects for which little or no prior speech data exists, VAANI holds particular value for researchers focused on building inclusive language technologies. It provides a foundation for developing ASR and natural language processing tools that could serve communities whose languages have historically been absent from mainstream AI development.

Significance for Indian Language Technology

The scale and linguistic diversity of VAANI make it a notable contribution to the field of multilingual speech research, particularly within the context of South Asian languages. By documenting spoken varieties from tribal, rural, and regional communities alongside more widely spoken languages, the dataset captures a dimension of India's linguistic heritage that is rarely represented in computational resources. Its open license facilitates use by academic institutions, government bodies, and industry researchers working toward more inclusive and representative speech technologies.

Willakuy