Tucano
Okokufakwayo
Okukhiphayo
Izigaba
Izindikimba
I-Tucano iwumndeni wamamodeli olimi e-transformer anesisindo esivulekile (open-weights) athuthukiswe eBrazil futhi aqeqeshwa kuphela ngombhalo wolimi lwesiPutukezi. Amamodeli aqeqeshwe ngaphambilini ku-GigaVerbo, isethi yedatha yamathokheni angamaPutukezi angaphindaphindwanga (deduplicated) angaba ngamabhiliyoni angu-200, futhi ayatholakala ngobukhulu obune, kusukela kumapharamitha ayizigidi eziyi-160 kuya kumapharamitha ayizigidigidi eziyi-2.4.
Izinhlobo ezilungiswe kahle (fine-tuned) zihlanganisa lezo ezilandela imiyalelo kanye nezinguqulo ezithuthukiselwe ukukhetha (preference-optimized), kanti okunye okuphuma emikhakheni eminingi (multimodal) okuhambisanayo kukhishwe ngaphansi kwegama elithi ViTucano. I-Tucano ihloselwe abacwaningi nabathuthukisi abasebenza ngemisebenzi yokucubungula ulimi lwemvelo (natural language processing) ngesiPutukezi, ulimi olungakaze luthole ukumela okwanele ekuthuthukisweni kwamamodeli amakhulu olimi.
Iphrojekthi ichazwe ephepheni lango-2025 elishicilelwe kumagazini i-Patterns, futhi ikhishwa ngaphansi kwelayisensi ye-Apache 2.0, kanti izisindo (weights) nekhodi kuyatholakala esidlangalaleni ku-GitHub. Uchungechunge lwamamodeli okwamanje selugcinwe (archived).
Isizinda Nokuthuthukiswa
I-Tucano yathuthukiswa eBrazil njengomzamo oqondile wokubhekana nokushoda kwamamodeli olimi amakhulu aqeqeshwe ngokukhethekile ngombhalo wesiPutukezi. Nakuba amamodeli amaningi avelele olimi eqeqeshwa ikakhulukazi kuma-corpus olimi lwesiNgisi, isiPutukezi—esikhulunywa ngabantu abangaphezu kwezigidi ezingu-250 eBrazil, ePortugal, nakwamanye amazwe—ngokomlando sibe nokunakwa okuncane ocwaningweni lwamamodeli ayisisekelo. Iphrojekthi ye-Tucano ihlose ukugcwalisa lelo gap ngokwakha amamodeli asekelwe ku-transformer kusukela ekuqaleni kusetshenziswa idatha enkulu, enekhwalithi ephezulu yesiPutukezi.
Amamodeli aqeqeshwe ngaphambili ku-GigaVerbo, i-corpus enezimpawu ezingaba ngamabhiliyoni angu-200 zamathokheni esiPutukezi ahlukanisiwe (deduplicated), eyahlanganiswa ukuze kusekelwe ukuqeqeshwa kolimi okuqinile ezingeni elikhulu. Iphrojekthi ibhalwe ephepheni Tucano: Advancing Neural Text Generation for Portuguese, elashicilelwa kumagazini Patterns ngo-2025, futhi wonke ama-weight kanye nekhodi yokuqeqesha kuyatholakala esidlangalaleni ku-GitHub ngaphansi kwelayisensi ye-Apache 2.0.
Izinhlobo Zamamodemu Nokuhlelwa Okuthuthukisiwe
Umndeni wakwa-Tucano uhlanganisa osayizi abane bama-model ayisisekelo, okuvumela abacwaningi nabathuthukisi ukuthi bakhethe imodeli efanele imikhawulo yabo yokubala kanye nezidingo zomsebenzi:
- Tucano-160m – amapharamitha ayizigidi ezingu-160
- Tucano-630m – amapharamitha ayizigidi ezingu-630
- Tucano-1b1 – cishe amapharamitha ayizigidi eziyizinkulungwane eziyi-1.1
- Tucano-2b4 – cishe amapharamitha ayizigidi eziyizinkulungwane eziyi-2.4
Ngaphezu kwamamodeli ayisisekelo aqeqeshwe ngaphambili, iphrojekthi ikhiqize okunye okuningana okuqondiswe kahle (fine-tuned). I-Tucano-SFT ne-Tucano-DPO zimelela izinhlobo zokuqeqeshwa okuqondiswe ngaphansi kokuqondiswa (supervised fine-tuning) nokwenziwa kokuthuthukiswa kokuncanyelwayo ngokuqondile (direct preference optimization) ngokulandelana, kanti i-Tucano-2b4-Instruct iyinguqulo elandela imiyalelo yemodeli enkulu kakhulu eyisisekelo. Lezi zinhlobo eziqondiswe kahle zandisa ukusetshenziswa kwamamodeli ayisisekelo kuya ezinhlelweni zokuxoxa (conversational) nezokusebenza (task-oriented).
Amamodeli ahlobene e-multimodal, i-ViTucano-1b5-v1 ne-ViTucano-2b8-v1, nawo akhishwe ngaphansi kwegama le-ViTucano, okubonisa umsebenzi olandelayo ohlanganisa izindlela ezibonakalayo (visual modalities) kanye nokuqonda umbhalo wesiPutukezi.
Izimo Zokusebenzisa Nohloswe Ngubani
I-Tucano ihloselwe ikakhulukazi abacwaningi nabathuthukisi abasebenza ngemisebenzi yokucubungula ulimi lwemvelo (natural language processing) ngesiPutukezi. Izimo ezingaba khona zokusetshenziswa zihlanganisa ukukhiqizwa kombhalo, ama-benchmark okuqonda ulimi (language modeling benchmarks), ukuqeqeshwa okuqondiswe kahle kwezinhlelo zokusebenza zesiPutukezi ezihambisana nomkhakha othile, kanye njengesisekelo socwaningo sokufunda indlela amamodeli aziphatha ngayo ezilimini ezinezinsiza eziphansi. Ukutholakala kosayizi abaningi bamamodeli kusekela izinhlobonhlobo zezimo zokusetshenziswa, kusukela ekuhlolweni kwezemfundo kumishini enomkhawulo kuya ocwaningweni oludinga izinsiza eziningi.
Ngenxa yokuthi amamodeli akhishwa ngaphansi kwelayisensi ye-Apache 2.0 enikeza inkululeko (permissive) enama-weight avulekile, angasetshenziswa ngokukhululeka, aguqulwe, futhi aphinde asatshalaliswe, okwenza atholakale emphakathini obanzi kuhlanganise nalabo abangenaso ukufinyelela kuma-API amamodeli obunikazi.
Isimo Samanje
Uchungechunge lwamamodemu e-Tucano okwamanje selugcinwe (archived), okusho ukuthi ukuthuthukiswa okuqhubekayo sekuphelile. Ama-weight, ikhodi, kanye nemibhalo ehambisana nawo kuhlala kutholakala esidlangalaleni nge-repository ye-GitHub yephrojekthi ukuze kubhekwe futhi kusetshenziswe okuqhubekayo umphakathi wocwaningo. Ukushicilelwa kwemiphumela ku-Patterns ngo-2025 kunikeza irekhodi elihloliwe kontanga (peer-reviewed) lendlela yokusebenza, idatha yokuqeqesha, nemiphumela yokuhlola ehlobene nephrojekthi, okusekela ukuphindaphindwa (reproducibility) kanye nocwaningo oluqhubekayo.