Tucano
Shigarwa
Fitarwa
Rukuni
Jigogi
Tucano iyali ne na samfuran harshe na transformer masu nauyi buɗe, waɗanda aka haɓaka a Brazil kuma aka horar da su keɓaɓɓu kan rubutun harshen Fotigal. An fara horar da samfuran a kan GigaVerbo, wani bayanai da ya ƙunshi kusan tiriliyan 200 na alamomin Fotigal da aka cire maimaitansu, kuma ana samunsu cikin girma huɗu daga miliyan 160 zuwa biliyan 2.4 na sigogi.
Daban-daban da aka yi wa gyara sun haɗa da nau’ikan bin umarni da waɗanda aka fi mayar da hankali kan fifiko, kuma an fitar da wasu nau’ikan da suka shafi multimodal a ƙarƙashin sunan ViTucano. An nufa da Tucano ga masu bincike da masu haɓakawa da ke aiki kan ayyukan sarrafa harshe na halitta a cikin Fotigal, wadda tarihi ta kasance tana da ƙarancin wakilci a ci gaban manyan samfuran harshe.
An rubuta aikin a cikin takarda ta 2025 da aka buga a mujallar Patterns, kuma an fitar da shi ƙarƙashin lasisin Apache 2.0, tare da samun nauyi da lambar lamba a bainar jama’a a GitHub. A halin yanzu an ajiye jerin samfurin a tarihi.
Bayani da Ci Gaba
An ƙirƙiri Tucano a Brazil a matsayin ƙoƙari na musamman don magance ƙarancin manyan ƙirar harshe da aka horar musamman a kan rubutun Portuguese. Yayin da yawancin fitattun ƙirar harshe ana horar da su galibi a kan tarin bayanai na Ingilishi, Portuguese—wanda ake magana da shi da mutane sama da miliyan 250 a Brazil, Portugal, da sauran ƙasashe—ta kasance tana samun kulawa kaɗan a binciken ƙirar tushe. Aikin Tucano ya yi niyyar rufe wannan gibi ta hanyar gina ƙirar transformer daga tushe, ta amfani da babban bayanan Portuguese mai inganci sosai.
An fara horar da ƙirar a kan GigaVerbo, wani tarin bayanai da ya ƙunshi kusan token Portuguese biliyan 200 da aka cire maimaitansu, domin tallafawa ƙaƙƙarfan ƙirar harshe a matakin girma. An rubuta aikin a cikin takardar Tucano: Advancing Neural Text Generation for Portuguese, wadda aka wallafa a mujallar Patterns a 2025, kuma dukkan nauyin (weights) da lambar horo suna samuwa a bainar jama'a a GitHub ƙarƙashin lasisin Apache 2.0.
Iri-iri na Samfura da Nau'ikan da Aka Yi Wa Gyara (Fine-Tuned)
Jerin Tucano ya ƙunshi girma-girma guda huɗu na ƙirar tushe, wanda ke ba wa masu bincike da masu haɓakawa damar zaɓar ƙirar da ta dace da iyakokin lissafi da bukatun aiki:
- Tucano-160m – sigogi miliyan 160
- Tucano-630m – sigogi miliyan 630
- Tucano-1b1 – kusan sigogi biliyan 1.1
- Tucano-2b4 – kusan sigogi biliyan 2.4
Baya ga ƙirar tushe da aka riga aka horar, aikin ya samar da wasu abubuwan da aka yi wa gyara. Tucano-SFT da Tucano-DPO suna wakiltar nau'ikan horo na kulawa (supervised fine-tuning) da kuma na inganta fifiko kai tsaye (direct preference optimization) bi da bi, yayin da Tucano-2b4-Instruct sigar bin umarni ce ta mafi girman ƙirar tushe. Waɗannan nau'ikan da aka yi wa gyara suna faɗaɗa amfani da ƙirar tushe zuwa aikace-aikacen tattaunawa da na manufa.
An kuma fitar da wasu ƙirar multimodal masu dangantaka, ViTucano-1b5-v1 da ViTucano-2b8-v1, a ƙarƙashin sunan ViTucano, wanda ke nuna aikin da zai shafi haɗa yanayin gani (visual modalities) tare da fahimtar rubutun Portuguese.
Amfani da Manufar Masu Karɓa
Tucano an fi nufin masu bincike da masu haɓakawa da ke aiki a ayyukan sarrafa harshe ta halitta (natural language processing) a Portuguese. Abubuwan da za a iya amfani da su sun haɗa da samar da rubutu, ma'aunin ƙirar harshe (language modeling benchmarks), yin gyare-gyare (fine-tuning) don aikace-aikacen Portuguese na takamaiman yanki, da kuma zama ginshiƙi na bincike wajen nazarin halayen ƙira a yanayin harsuna masu ƙarancin albarkatu. Samuwar girma-girma daban-daban na ƙira yana tallafawa nau'ikan shigarwa iri-iri, daga gwajin ilimi a kan kayan aiki masu iyaka zuwa bincike mai buƙatar albarkatu da yawa.
Saboda an saki ƙirar ƙarƙashin lasisin Apache 2.0 mai ba da damar amfani (permissive) tare da buɗaɗɗen nauyi (open weights), ana iya amfani da su kyauta, a gyara su, kuma a sake rarraba su, wanda ke sa su samu ga babbar al'umma har ma da waɗanda ba su da damar yin amfani da takamaiman APIs na ƙira.
Matsayi na Yanzu
Jerin ƙirar Tucano a halin yanzu an ajiye shi a tarihi (archived), ma'ana an kammala ci gaba mai aiki. Nauyin (weights), lambar, da takardun da suka danganci su suna ci gaba da kasancewa a bainar jama'a ta hanyar ma'ajin GitHub na aikin, domin tunani da ci gaba da amfani da su ta hanyar al'ummar bincike. Wallafa sakamakon a Patterns a 2025 ya samar da bayanin da aka yi nazari da takwarorinsu (peer-reviewed) na tsarin hanya, bayanan horo, da sakamakon kimantawa da suka shafi aikin, yana tallafawa maimaituwa (reproducibility) da ƙarin nazari.