PeLLE

Background and Development

PeLLE was developed through a collaboration between the University of São Paulo and IBM Research Brazil, representing a joint academic and industry effort to advance natural language processing capabilities for Brazilian Portuguese. Published in 2024 and detailed in a preprint available on arXiv, the model addresses a recognized gap in the availability of high-quality, purpose-built language models for Brazilian Portuguese, a language with distinct lexical and syntactic characteristics compared to European Portuguese.

The model is built on the RoBERTa architecture, an encoder-only transformer design that has proven effective for a range of natural language understanding tasks. Rather than adapting a multilingual model, PeLLE was trained on the Carolina Corpus v1.2 Ada, a corpus specifically compiled to reflect Brazilian Portuguese text, providing the model with domain-appropriate linguistic grounding.

Key Characteristics

As an encoder-only model, PeLLE is structured for natural language understanding rather than text generation. This design is well suited to tasks that require representing and classifying input text. The model supports the following tasks:

Natural language inference
Hate speech detection
Text classification
Masked language modeling

PeLLE is released with open weights under the Creative Commons Attribution 4.0 (CC BY 4.0) license, allowing researchers and developers to use, adapt, and redistribute the model with attribution. This licensing approach makes it accessible to a broad range of users within the academic and technical communities.

Use Cases and Target Audience

PeLLE is intended primarily for researchers and practitioners working on Brazilian Portuguese NLP applications. Its support for hate speech detection makes it relevant for content moderation research and platform safety applications. The natural language inference capability enables tasks such as textual entailment and semantic similarity assessment, which underpin a variety of downstream applications including question answering and fact verification systems.

Text classification support broadens its applicability to sentiment analysis, topic categorization, and document organization tasks. Because the model is encoder-only and grounded in a Brazilian Portuguese corpus, it is most appropriate for understanding and classification tasks rather than open-ended text generation.

Context Within Brazilian Portuguese NLP

The development of PeLLE reflects a broader trend of building language-specific or regionally-specific models to improve performance on underrepresented languages and language variants. While multilingual models provide general coverage, models trained on targeted corpora such as the Carolina Corpus v1.2 Ada can offer more precise linguistic representations for the specific variety of Portuguese spoken in Brazil. The collaboration between the University of São Paulo and IBM Research Brazil situates PeLLE within both academic research and applied technology contexts, contributing a resource to the Brazilian NLP ecosystem that is openly licensed and reproducible.

Shigarwa

Fitarwa

Samfurin Asali

Rukuni

Jigogi

Background and Development

Key Characteristics

Use Cases and Target Audience

Context Within Brazilian Portuguese NLP