Tucano

Background na Development

Tucano na Brazil dem develop am, e be dedicated effort wey dem use tackle shortage of large-scale language models wey train specifically for Portuguese text. Biko, plenty prominent language models dem train mostly with English-language corpora, but Portuguese—wey people don over 250 million for Brazil, Portugal, na other countries—historically don get less attention for foundation model research. For project Tucano, dem aim close gap yango by building transformer-based models from scratch using big, high-quality Portuguese dataset.

Dem pre-train the models with GigaVerbo, na corpus wey get around 200 billion deduplicated Portuguese tokens, wey dem assemble to support strong language modeling for scale. Project no get documentation for paper Tucano: Advancing Neural Text Generation for Portuguese, wey publish for journal Patterns in 2025, and all weights plus training code dey publicly available for GitHub under Apache 2.0 license.

Model Variants na Fine-Tuned Versions

Famili Tucano cover four different base model sizes, so researchers and developers fit choose model wey match their computational constraints and task requirements:

Tucano-160m – 160 million parameters
Tucano-630m – 630 million parameters
Tucano-1b1 – around 1.1 billion parameters
Tucano-2b4 – around 2.4 billion parameters

Plus the base pre-trained models, project produce several fine-tuned derivatives. Tucano-SFT and Tucano-DPO represent supervised fine-tuning and direct preference optimization variants respectively, while Tucano-2b4-Instruct na version wey follow instruction for largest base model. These fine-tuned versions extend how useful the base models go be for conversation and task-oriented applications.

Related multimodal models, ViTucano-1b5-v1 and ViTucano-2b8-v1, also release under name ViTucano, showing say dem do follow-up work wey include visual modalities alongside Portuguese text understanding.

Use Cases na Intended Audience

Tucano mostly target researchers and developers wey dey work on natural language processing tasks for Portuguese. Possible use cases include text generation, language modeling benchmarks, fine-tuning for Portuguese applications wey dey specific to domain, and as research baseline for study how model behave for low-resource language settings. Availability of multiple model sizes support different deployment scenarios, from academic experiment for limited hardware to more resource-intensive applied research.

Because dem release the models under permissive Apache 2.0 license with open weights, people fit use am freely, modify am, and redistribute am, so e easy reach broad community—inkluding people wey no get access to proprietary model APIs.

Current Status

Right now, Tucano model series don archive, meaning active development don stop. Weights, code, and the documentation wey follow am still dey publicly accessible through the project GitHub repository for reference and continued use by research community. Publication of findings for Patterns in 2025 give peer-reviewed record of the methodology, training data, and evaluation results wey relate to the project, so e support reproducibility and further study.

Input

Output

Categories

Themes

Background na Development

Model Variants na Fine-Tuned Versions

Use Cases na Intended Audience

Current Status