Transformer Apache 2.0 Yes
Português

Input

text

Output

text

Themes

generative ainlpresearch

Tucano na family open-weights transformer language model wey dem develop for Brazil, na im train only with Portuguese text. Dem pre-train am for GigaVerbo, dataset wey get like 200 billion deduplicated Portuguese tokens, and e dey available for four different size—start from 160 million go reach 2.4 billion parameters.

Fine-tuned version include one wey sabi follow instruction, and one wey dem optimize based on preference, plus related multimodal derivatives wey dem release under the name ViTucano. Tucano meant for researchers and developers wey dey work on natural language processing tasks for Portuguese, language wey historically no get plenty representation for big language model development.

Dem document project for paper 2025 wey publish for journal Patterns, and dem release am under Apache 2.0 license. Weights and code dey publicly for GitHub. For now, dem don archive the model series.

Background na Development

Tucano na Brazil dem develop am, e be dedicated effort wey dem use tackle shortage of large-scale language models wey train specifically for Portuguese text. Biko, plenty prominent language models dem train mostly with English-language corpora, but Portuguese—wey people don over 250 million for Brazil, Portugal, na other countries—historically don get less attention for foundation model research. For project Tucano, dem aim close gap yango by building transformer-based models from scratch using big, high-quality Portuguese dataset.

Dem pre-train the models with GigaVerbo, na corpus wey get around 200 billion deduplicated Portuguese tokens, wey dem assemble to support strong language modeling for scale. Project no get documentation for paper Tucano: Advancing Neural Text Generation for Portuguese, wey publish for journal Patterns in 2025, and all weights plus training code dey publicly available for GitHub under Apache 2.0 license.

Model Variants na Fine-Tuned Versions

Famili Tucano cover four different base model sizes, so researchers and developers fit choose model wey match their computational constraints and task requirements:

  • Tucano-160m – 160 million parameters
  • Tucano-630m – 630 million parameters
  • Tucano-1b1 – around 1.1 billion parameters
  • Tucano-2b4 – around 2.4 billion parameters

Plus the base pre-trained models, project produce several fine-tuned derivatives. Tucano-SFT and Tucano-DPO represent supervised fine-tuning and direct preference optimization variants respectively, while Tucano-2b4-Instruct na version wey follow instruction for largest base model. These fine-tuned versions extend how useful the base models go be for conversation and task-oriented applications.

Related multimodal models, ViTucano-1b5-v1 and ViTucano-2b8-v1, also release under name ViTucano, showing say dem do follow-up work wey include visual modalities alongside Portuguese text understanding.

Use Cases na Intended Audience

Tucano mostly target researchers and developers wey dey work on natural language processing tasks for Portuguese. Possible use cases include text generation, language modeling benchmarks, fine-tuning for Portuguese applications wey dey specific to domain, and as research baseline for study how model behave for low-resource language settings. Availability of multiple model sizes support different deployment scenarios, from academic experiment for limited hardware to more resource-intensive applied research.

Because dem release the models under permissive Apache 2.0 license with open weights, people fit use am freely, modify am, and redistribute am, so e easy reach broad community—inkluding people wey no get access to proprietary model APIs.

Current Status

Right now, Tucano model series don archive, meaning active development don stop. Weights, code, and the documentation wey follow am still dey publicly accessible through the project GitHub repository for reference and continued use by research community. Publication of findings for Patterns in 2025 give peer-reviewed record of the methodology, training data, and evaluation results wey relate to the project, so e support reproducibility and further study.

Report