21 500 hours CC-BY 4.0 Training Zonke iilwimi

Imixholo

voice audionlp

I-VAANI yidathasethi enkulu yeelwimi ezahlukeneyo eyaphuhliswa yi-Indian Institute of Science (IISc), eBangalore, equlethe malunga neeyure eziyi-21,500 zeaudio eziqokelelwe kwiimfundo ezimalunga ne-110,000 zabantu abakhuluma kwiindawo ezili-120 kwiimeko ze-22 zeIndia. Iqulethe iilwimi ezili-86 kunye neendlela ezikhuluma eIndia, kubandakanywa iilwimi eziphambili ezicwangcisiweyo kunye neendlela ezininzi zeemeko kunye nezemveli, kunye neeyure eziyi-835 zeengxoxo ezibhalwe phantsi.

Le datha yenzelwe ukuxhasa imisebenzi eyahlukeneyo yokuthetha nolwimi kuquka ukwamkela okuzenzekelayo kokuthetha, ukuveliswa komsindo ophuma kumxoxo, ukuchonga umkhululi, ukuchonga ulwimi, ukuphuculwa kokuthetha, kunye nophuhliso lweemodeli zolwimi ezineendlela ezininzi. Ikhutshelwe phantsi komvume ye-CC BY 4.0, ibhekiswe ikakhulu ekuqeqesheni nasekuhloleni iinkqubo ze-AI, ikakhulu ibalulekile kwiingcali kunye nabaphuhlisi abasebenza kwiilwimi zeIndia ezineemithombo eziphantsi kunye nezingabandakanywanga.

Imvelaphi kunye nokuThuthukiswa

VAANI (okuthetha "ilizwi" okanye "ukuthetha" kwiilwimi ezininzi zaseIndiya) yaveliswa yi-Indian Institute of Science (IISc) eBangalore njengenxalenye yomzamo wokujolisa kumgca omkhulu kwiinkcukacha zedatha yokuthetha kwiilwimi zaseIndiya. Umhlaba wezilwimi zaseIndiya ungowona unokwahluka kakhulu emhlabeni, ubandakanya iilwimi eziliqela kunye neentlobo eziliqela, nangona kunjalo, inani elikhulu leenkcukacha zokuthetha ezikhoyo ligxile kwiilwimi ezithile ezineemithombo ezilungileyo. VAANI yathuthukiswa ukuze wandise ukufumaneka kwedatha yokuqeqesha kule ndawo yezilwimi engaphuhliswanga ngokwenza uqokelelo lwedatha lwemida enkulu kwiindawo ezahlukeneyo ngokwendawo nangokweelwimi zezwe.

Idatha iqokelelwe kwiingxelo ezimalunga ne-110,000 zabaqeqeshi abahlala kwiindawo ezili-120 kwiimeko ze-22 zaseIndiya, ibonisa umzamo ophakathi kokufumana utshintsho lwezindawo nolwimi olwenene endaweni yokuthembela kwiirekhodi ezilawulwayo ze-studio. Isethi yedatha ibandakanya iilwimi ezili-86 kunye neentlobo, ukusuka kwiilwimi eziphambili ezifana neHindi, Tamil, Telugu, Bengali, Kannada, kunye neMalayalam ukuya kwiintlobo ezininzi zendawo kunye neelwimi zomphakathi ezibandakanya iGondi, Santali, Kurukh, Wancho, kunye neTenyidie, phakathi kwezinye ezininzi.

Ukwakhiwa kweDataset kunye neMisebenzi eBalulekileyo

VAANI ibandakanya malunga ne-21,500 iiyure ze-audio ngokubanzi, iyenza ibe yenye yeenkcukacha zokuthetha ezinkulu kakhulu ezijolise kwiilwimi zaseIndiya. Kwi-835 iiyure ziye zaguqulwa, zinika i-annotations zethegi eziyinyani zeqela leqoqo. Isethi yedatha ibandakanya izinto eziphilayo, ibeka ukuba isetyenziswe ngaphezu kwemisebenzi yokuthetha eqhelekileyo.

Iimpawu eziphambili ze dataset ziquka:

  • Umgca weelwimi ezili-86 kunye neentlobo, kubandakanywa neentlobo ezininzi ezineemithombo eziphantsi kunye neelwimi zomphakathi
  • Iziphumo ezivela kwiingxelo ezimalunga ne-110,000 zabaqeqeshi abahlala kwiindawo ezahlukeneyo ngokwendawo nangokwenani labantu
  • 21,500 iiyure ze-audio ngokubanzi kunye ne-835 iiyure zokuthetha eziguqulweyo
  • Urekhodi lwezikhumbuzo ezibandakanya iindawo ezili-120 kwiimeko ze-22 zaseIndiya
  • Ikhishwe phantsi komthetho we-CC BY 4.0, ivumela ukusebenzisa ngokubanzi kunye nokubhekisela
  • Ukwesekwa kwemisebenzi ye-unimodal kunye ne-multimodal

Iimfuno ezixhaswayo kunye neMisebenzi

VAANI yenzelwe ukuba ibe ngumthombo wemisebenzi eyahlukeneyo yokuthetha kunye neenkqubo zolwimi. Abaphandi kunye nabaqambi banokuyisebenzisa ukuze baqeqeshe kwaye bahlule iisistim zokufumanisa ukuthetha ngokuzenzekelayo (ASR), ukuveliswa kwe-text-to-speech (TTS), ukuchonga abathethi, kunye nemodeli yokuchonga ulwimi. Isethi yedatha ikwasupporta umsebenzi wokuphucula ukuthetha kunye nophuhliso lweemodeli ezinkulu zolwimi eziphilayo (LLMs). Ububanzi bayo bokufumaneka kweelwimi bwenza ukuba ibaluleke kakhulu kumgangatho wokuhlola iisistim ezijolise ekusebenzeni kwiimeko ezahlukeneyo zolwimi zaseIndiya.

Ukuba neelwimi ezininzi ezineemithombo eziphantsi kunye neentlobo apho kudala okanye akukho datha yokuthetha, VAANI ibaluleke kakhulu kubaphandi abagxile ekwakheni iitekhnoloji zolwimi ezibandakanyekayo. Iphakamisa isiseko sokuphuhlisa i-ASR kunye neetekhnoloji zokucubungula ulwimi olwenene ezinokubonelela ngentsebenziswano kumphakathi apho iilwimi zabo zihlala zingekho kwiinkqubo ze-AI eziphambili.

Ukubaluleka kwiTekhnoloji yeLwimi laseIndiya

Ubukhulu kunye nokwahluka kwezilwimi ze-VAANI kuyenza ibe yinxalenye ebalulekileyo kumphakathi wezifundo zokuthetha ezahlukeneyo, ngakumbi kumxholo weelwimi zaseMzantsi Asia. Ngokubhalisela iintlobo ezithethwayo ezivela kwiindawo zomphakathi, ezisemaphandleni, kunye neendawo ezahlukeneyo kunye neelwimi ezithethwayo kakhulu, isethi yedatha ibamba umphakathi wezilwimi zaseIndiya ongenakufumaneka kwiinkqubo zokusebenza. Ilayisensi yayo evulekileyo iyenza ibe lula ukuyisebenzisa kwiinkqubo zezemfundo, kwiimeko zikarhulumente, kunye nabaphandi bezeMveliso abaqhuba phambili kwiitekhnoloji zokuthetha ezibandakanyekayo kunye nezimele.

Ingxelo