Kategorî
Tèmɛw
VAANI ye kan ka bonba camanbɔli dɔ ye min labɛnnen don Indian Institute of Science (IISc), Bangalore fɛ, o min bɛ se ka sɔrɔ jɛrɛ 21,500 lɛrɛnw la, minw lajɛlen don kuma bɔla 110,000 ɲɔgɔnna la, India mara 22 kɔnɔ sigida 120 la. A bɛ kan ni kan cogoyaw 86 la minnu bɛ kuma India yɔrɔw bɛɛ la, o kɔnɔ ka taa kanba kuraw ma ani yɔrɔyɔrɔ kanw ni sɔnninkɛlaw ka kan cogoyaw caman fana, ani kuma min sɛbɛnnen don lɛrɛ 835.
Camancogo in labɛnnen don walasa ka dɛmɛ kan ni kuma baara suguya caman ma, i n’a fɔ otomatigi kuma dɔnniya, sɛbɛn-ma-kuma dafalen, kuma bɔla dɔnniya, kan dɔnniya, kuma ɲɛnabɔli, ani kan modeli labɛnnibɔ min bɛ baara kɛ ni fɛɛrɛ caman ye. A bɔra kɛnɛya la CC BY 4.0 lasiri kɔnɔ, ani a jɔyɔrɔ ye ka kɛ AI sistɛmuw kalan ni sɛgɛsɛgɛli ye fɔlɔ, kɛrɛnkɛrɛnna bɛɛ ye minnu bɛ baara kɛ kanw kan minnu ka nafolo dɔɔnin don walima minnu tɛ se ka bonya sɔrɔ India kɔnɔ.
Kunfɛko ni Labɛnni
VAANI (min bɛ kɔrɔ “kuma” walima “fɔli” ye ɛnɛdiya kan caman na) Indian Institute of Science (IISc) de y’a da Bangalore kɔnɔ, ka kɛ cɛsiri dɔ ye walasa ka dɛmɛ ka kuma-donw ka fɛrɛnw baara-jɛlenba dɛsɛ Indian kanw la. India ka kanw dunyako ye minnu ka caman kosɛbɛ dunuɲa la, a bɛ kan kɛmɛw ni lahajaw wuliw kɛlɛnkɛlɛn don, nka kuma-donw ka datasets minnu bɛ yen sisan, olu ka caaya bɛ sigi kan dɔw kan, minnu ka fɛrɛnw ka bonya. VAANI labɛnna walasa ka kalan-donw data sɔrɔli fara nin kanw yɔrɔ minnu ma baarakɛla sɔrɔ kosɛbɛ la, ka tɛmɛ bɛɛrɛba fɛrɛnw data lajɛli kan jamana yɔrɔw minnu ka jɛgɛya ni kanw ka jɛgɛya bɛ kɛlɛnkɛlɛn.
Data lajɛra kuma-fɔlaw 110,000 ɲɔgɔnna la, minnu bɛ jɔ 120 districts kɔnɔ Indian states 22 la, ka jira ko i y’a kɛ sariya la walasa ka yɔrɔw ni lahajaw ka yɛrɛyɛrɛ fɔli sɔrɔ, tɛ ka dɔ fara studio recordings labɛnnenw kan. Dataset in bɛ taa kanw ni lahajaw 86 kan, ka bɔ kanba minnu bɛ sɔrɔ sariya la i n’a fɔ Hindi, Tamil, Telugu, Bengali, Kannada, ni Malayalam, ka se ka taa yɔrɔw ka kanw ni jamakulu kanw ma i n’a fɔ Gondi, Santali, Kurukh, Wancho, ni Tenyidie, ani caman wɛrɛw fana.
Dataset ka Jɔyɔrɔ ni A ka Fɛrɛ Cɛmanw
VAANI bɛ audio lɛrɛsigi 21,500 ɲɔgɔnna la bɛɛ, min k’a kɛ kan caman kuma-korpusw minnu ka bon kosɛbɛ Indian kanw kan na dɔ ye. O la, lɛrɛsigi 835 sɛbɛnna, ka sariya-kuma annotations di lajɛli dɔ kan. Dataset in fana bɛ multimodal fɛrɛnw don, min b’a sigi baara la kuma-baaraw koɲumanw kɔfɛ fana.
Dataset in ka fɛrɛ cɛmanw ye nin ye:
- Kanw ni lahajaw 86 lajɛli, ka kɛ ni low-resource ni jamakulu ka fɔli caman ye
- Kuma-fɔlaw 110,000 ɲɔgɔnna ka dɛmɛ, minnu bɔ yɔrɔw ni mogow ka sigida jɛgɛya caman na
- Audio lɛrɛsigi 21,500 bɛɛ la, ni kuma sɛbɛnnen lɛrɛsigi 835
- Field recordings minnu bɛ taa districts 120 kan Indian states 22 la
- A bɔra kɛnɛ CC BY 4.0 license kɔnɔ, min bɛ sira di ka a taara kɛ yɔrɔ caman na ni tɔgɔ-fɔli ye
- Dɛmɛ unimodal ni multimodal recherche baaraw fila bɛɛ la
Baara Minnu A Bɛ Dɛmɛ ni Baarakɛliw
VAANI labɛnna walasa k’a kɛ fɛrɛ ye kuma ni kan baaraw sugu caman na. Recherchekɛlaw ni developers bɛ se k’a kɛ automatic speech recognition (ASR) systèmes kalan ni kɔrɔsi la, text-to-speech (TTS) synthesis, speaker identification, ani language identification modèles la. Dataset in fana bɛ dɛmɛ speech enhancement baara la ani multimodal large language models (LLMs) labɛnni la. A ka kanw lajɛli ka bonya b’a kɛ nafamaya kosɛbɛ systèmes minnu lajɛra walasa u ka baara kɛ India ka kan jɛgɛya caman kɔnɔ.
Sabu kanw ni lahajaw low-resource caman don a kɔnɔ, minnu fɛ kuma-data ko dɔ walima foyi tɛ yen ka kɔrɔ, VAANI ka nafaw ka bonya kosɛbɛ recherchekɛlaw ye minnu bɛ baara kɛ kan teknolojiyaw minnu bɛ mogow bɛɛ ta la. A bɛ dugukolo di ASR ni naturel language processing outils labɛnni ye, minnu tun bɛ se ka baarakɛli kɛ jamaw ye minnu ka kanw tun tɛ sɔrɔ AI labɛnni baara kɔrɔba la.
Nafamaya Indian Kan Teknoloji la
VAANI ka bonya ni a ka kan jɛgɛya caman b’a kɛ dɛmɛba dɔ ye kan caman kuma-recherche foro la, kɛlɛnkɛlɛn South Asian kanw ko la. Ni a bɛ jamakulu, duguw, ni yɔrɔw ka fɔliw sɛbɛn kɛ ka fara kanw minnu bɛ fɔ kosɛbɛ kan, dataset in bɛ India ka kan ciyɛn dɔ lajɛ min tɛ sɔrɔ ka bonya informatique fɛrɛnw la. A ka open license bɛ sira di ka a kɛ kalansoow, gouvernement kurumew, ani industrie recherchekɛlaw bolo, minnu bɛ baara kɛ walasa ka kuma teknolojiyaw kɛ ka bɛɛ ta ni ka yɛrɛ ta.