21,500 hours CC-BY 4.0 Training Zonke izilimi

Izindikimba

voice audionlp

I-VAANI iyidathasethi enkulu yeziqu zomsindo eziningi ezilimi ezithuthukiswe yi-Indian Institute of Science (IISc), eBangalore, equkethe cishe amahora angama-21,500 omsindo aqoqwe kubantu abangu-110,000 emaphandleni angama-120 ezindaweni ezingu-22 eNdiya. Ihlanganisa izilimi ezingu-86 nezilimi ezikhuluma eNdiya, kuhlanganise nezilimi ezinkulu ezihlelwe kahle kanye nezinhlobo eziningi zendawo nezamaqembu, enama-835 amahora omsindo ahloliwe.

Le datha ihloselwe ukweseka uhla lwezabelo zomsindo nolimi kuhlanganisa ukujwayela umsindo okuzenzakalelayo, ukuhlanganiswa kombhalo nomsindo, ukutholwa komkhulumeli, ukutholwa kolimi, ukuthuthukiswa komsindo, kanye nokuthuthukiswa kwemodeli yolimi eminingi. Ikhishwe ngaphansi kwemvume ye-CC BY 4.0, ihloselwe ikakhulukazi ukuqeqesha nokuhlola izinhlelo ze-AI, ikakhulukazi kubacwaningi nabathuthukisi abasebenza ngezilimi zaseNdiya eziphansi kwezinsiza futhi ezingamukeleki kahle.

Isizinda Nokuthuthukiswa

VAANI (okuhumusha "izwi" noma "ukukhuluma" ezilimini eziningi zaseNdiya) yakhiwe yi-Indian Institute of Science (IISc) eBangalore njengengxenye yomzamo wokubhekana nesikhala esikhulu emithonjeni yedatha yokukhuluma ezilimini zaseNdiya. Uhlaka lwezilimi zaseNdiya luhlanganisa izilimi eziningi kakhulu emhlabeni, kufaka phakathi izilimi eziyikhulu nezinkulungwane zezilimi ezihlukahlukene, nokho iningi lemithombo yedatha yokukhuluma ekhona ligxile ezilimini ezithile ezinamandla. I-VAANI yathuthukiswa ukuze ikhulise ukutholakala kwedatha yokufundisa kulolu hlobo lwezilimi olungaphansi kokutholwa ngokwenza ukuqoqwa kwedatha emkhakheni omkhulu ezindaweni ezihlukahlukene ngokwendawo nangokwenkolo zezwe.

Idatha iqoqwe kubantu abangu-110,000 abakhuluma ezindaweni eziyi-120 ezizweni eziyi-22 zaseNdiya, ibonisa umzamo oqondile wokuthola umehluko wezakhamuzi nezilimi esikhundleni sokuthembela kumarekhodi alawulwayo. Isethi yedatha ihlanganisa izilimi ezingu-86 nezilimi ezihlukahlukene, kusukela ezilimini ezinkulu ezihlelwe njengeHindi, Tamil, Telugu, Bengali, Kannada, neMalayalam kuya ezinhlelweni eziningi zendawo nezilimi zomdabu ezifana neGondi, Santali, Kurukh, Wancho, neTenyidie, phakathi kwezinye.

Ukwakhiwa Kwedatha Nezici Eziyinhloko

IVAANI ihlanganisa cishe amahora angama-21,500 wezwi jikelele, iyenza ibe enye yeziqoqo ezinkulu kakhulu zokukhuluma eziningi ezigxile ezilimini zaseNdiya. Kulokhu, amahora angama-835 abhalwe phansi, ehlinzeka ngombhalo oqondile wezincazelo zengqikithi yeqoqo. Isethi yedatha iphinde ifake izinto eziningi, iyibeka ukuze isetshenziswe ngaphezu kwemisebenzi ejwayelekile yokukhuluma.

Izici eziyinhloko zalesi sethi yedatha zifaka:

  • Ukufakwa kwezilimi ezingu-86 nezilimi ezihlukahlukene, kufaka phakathi izinhlelo eziningi ezinamathuba amancane nezilimi zomdabu
  • Iminikelo evela kubantu abangu-110,000 abakhuluma ezizindeni ezihlukahlukene zendawo nezokuhlala
  • Amahora angama-21,500 wezwi jikelele anama-835 amahora okukhuluma abhalwe phansi
  • Ukurekhodwa kwangaphandle okuhlanganisa izifunda eziyi-120 ezizweni eziyi-22 zaseNdiya
  • Khululwa ngaphansi kwemvume ye-CC BY 4.0, ivumela ukusetshenziswa okubanzi ngokunikezwa kwesithakazelo
  • Ukusekela kokubili izicelo zocwaningo ezilodwa neziningi

Imisebenzi Esekela Nezimo Zokusebenzisa

IVAANI yakhelwe ukuba ibe umthombo wezinsiza zohlobo oluningi lwezisebenzi zokukhuluma nezokucubungula ulimi. Abacwaningi nabathuthukisi bangayisebenzisa ukuze baqeqeshe futhi bahlola izinhlelo zokukhomba ukukhuluma ngokuzenzakalelayo (ASR), ukuhlanganiswa kwezwi kusuka embhalweni (TTS), ukuhlonza abakhuluma, nezinhlelo zokuhlonza ulimi. Isethi yedatha iphinde isekele umsebenzi wokuthuthukisa ukukhuluma nokwakhiwa kwemodeli ezinkulu zezilimi eziningi (LLMs). Ububanzi bokufakwa kwezilimi bwenza kube kubalulekile kakhulu ekubhekeni izinhlelo ezihlosiwe ukusebenza ezindaweni ezihlukahlukene zolimi zaseNdiya.

Ngokubandakanywa kwezilimi eziningi ezinamathuba amancane nezilimi ezihlukahlukene lapho kunedatha yokukhuluma encane noma engatholakali, i-VAANI inenani elikhethekile kubacwaningi abagxile ekwakheni ubuchwepheshe bezilimi obuhlanganisayo. Iphakamisa isisekelo sokwakha amathuluzi e-ASR nezokucubungula ulimi lwemvelo angase asebenze emphakathini onolimi lwawo olungakaze lube khona ekuthuthukisweni kwe-AI okujwayelekile.

Ukubaluleka Kwezobuchwepheshe Bezilimi ZaseNdiya

Usayizi nobuhlobo bezilimi be-VAANI benza kube umnikelo ophawulekayo emkhakheni wezocwaningo zokukhuluma eziningi, ikakhulukazi ngaphakathi komongo wezilimi zaseNingizimu Asia. Ngokubhalisela izinguquko ezikhuluma ezivela ezindaweni zomdabu, ezisemaphandleni, nasezindaweni ezihlukahlukene kanye nezilimi ezikhuluma kakhulu, isethi yedatha ibamba umqondo wesiko lolimi lweNdiya ongajwayelekile emithonjeni yokubala. Imvume yayo evulekile ivumela ukusetshenziswa ezikhungweni zemfundo, emaphakathini kahulumeni, nakwabacwaningi bezimboni abasebenza ukuze kuthuthukiswe ubuchwepheshe bokukhuluma obuhlanganisayo nokuqondile.

Bika