Na’urar canza wuta CC-BY 4.0 Yes
Português

Shigarwa

text

Fitarwa

text

Samfurin Asali

RoBERTa

Jigogi

nlpresearch

PeLLE is an encoder-only transformer language model developed by the University of São Paulo and IBM Research, designed for Brazilian Portuguese natural language understanding tasks. Built on a RoBERTa architecture and trained on the Carolina Corpus v1.2 Ada, it supports tasks including natural language inference, hate speech detection, text classification, and masked language modeling.

Released in 2024 under the CC BY 4.0 license with open weights, PeLLE is intended for researchers and developers working on Portuguese-language NLP applications. Its training on a dedicated Brazilian Portuguese corpus makes it a targeted resource for the Brazilian research and technology community.

Background and Development

PeLLE was developed through a collaboration between the University of São Paulo and IBM Research Brazil, representing a joint academic and industry effort to advance natural language processing capabilities for Brazilian Portuguese. Published in 2024 and detailed in a preprint available on arXiv, the model addresses a recognized gap in the availability of high-quality, purpose-built language models for Brazilian Portuguese, a language with distinct lexical and syntactic characteristics compared to European Portuguese.

The model is built on the RoBERTa architecture, an encoder-only transformer design that has proven effective for a range of natural language understanding tasks. Rather than adapting a multilingual model, PeLLE was trained on the Carolina Corpus v1.2 Ada, a corpus specifically compiled to reflect Brazilian Portuguese text, providing the model with domain-appropriate linguistic grounding.

Key Characteristics

As an encoder-only model, PeLLE is structured for natural language understanding rather than text generation. This design is well suited to tasks that require representing and classifying input text. The model supports the following tasks:

  • Natural language inference
  • Hate speech detection
  • Text classification
  • Masked language modeling

PeLLE is released with open weights under the Creative Commons Attribution 4.0 (CC BY 4.0) license, allowing researchers and developers to use, adapt, and redistribute the model with attribution. This licensing approach makes it accessible to a broad range of users within the academic and technical communities.

Use Cases and Target Audience

PeLLE is intended primarily for researchers and practitioners working on Brazilian Portuguese NLP applications. Its support for hate speech detection makes it relevant for content moderation research and platform safety applications. The natural language inference capability enables tasks such as textual entailment and semantic similarity assessment, which underpin a variety of downstream applications including question answering and fact verification systems.

Text classification support broadens its applicability to sentiment analysis, topic categorization, and document organization tasks. Because the model is encoder-only and grounded in a Brazilian Portuguese corpus, it is most appropriate for understanding and classification tasks rather than open-ended text generation.

Context Within Brazilian Portuguese NLP

The development of PeLLE reflects a broader trend of building language-specific or regionally-specific models to improve performance on underrepresented languages and language variants. While multilingual models provide general coverage, models trained on targeted corpora such as the Carolina Corpus v1.2 Ada can offer more precise linguistic representations for the specific variety of Portuguese spoken in Brazil. The collaboration between the University of São Paulo and IBM Research Brazil situates PeLLE within both academic research and applied technology contexts, contributing a resource to the Brazilian NLP ecosystem that is openly licensed and reproducible.

Rahoto