Teuken-7B
Input
Output
Categories
Themes
Teuken-7B is a 7-billion-parameter transformer-based language model developed by OpenGPT-X and Fraunhofer IAIS in Germany, designed to support all 24 official languages of the European Union. Trained from scratch with approximately 50% non-English pretraining data, it is notably oriented toward multilingual European use cases rather than defaulting to English-centric training approaches common in many large language models.
The model is available in both base and instruction-tuned variants, with versions released under Apache 2.0 for commercial use, a research license for academic purposes, and CC BY-NC 4.0 for non-commercial applications. Intended use cases include retrieval-augmented generation, document summarization, text generation, information extraction, and conversational applications, making it relevant for organizations and developers operating across multilingual European contexts.
Teuken-7B: Europe's Sovereign Open-Source LLM Matures
In January 2026, Deutsche Telekom announced the SOOFI project, a major initiative to build a 100-billion-parameter successor to Teuken-7B. Training for this massive new model begins this month, March 2026, on Nvidia DGX B200 systems. This marks a rapid evolution for the OpenGPT-X consortium. Since Teuken-7B launched in late 2024, it has become a cornerstone of European digital sovereignty. Major cloud providers, including IONOS, integrated the model into their AI hubs in 2025. Deutsche Telekom also launched the first commercial enterprise offerings based on the model. For machine learning engineers, Teuken-7B remains a highly relevant, production-ready 7-billion-parameter model optimized for the 24 official languages of the European Union.
Architecture and Training Methodology
Teuken-7B is a transformer-based, decoder-only large language model. It features a 4096-token context window, 32 attention heads, and Grouped Query Attention (GQA). Researchers trained the model from scratch on the JUWELS Booster supercomputer at Forschungszentrum Jülich using Nvidia A100 GPUs.
The training data sets Teuken-7B apart from American competitors. The base model (v0.4) ingested 4 trillion tokens, while the updated v0.6 base model expanded this to 6 trillion tokens. Unlike models that bolt on multilingual capabilities through fine-tuning, Teuken-7B is natively multilingual. Its pre-training corpus contains approximately 50% non-English European text and 40% English text, alongside code. For comparison, Meta's Llama 3.1 8B contains only 8% non-English data.
To process this data efficiently, the OpenGPT-X team built a custom multilingual tokenizer. This tokenizer drastically reduces the compute overhead for European languages. Processing German text with the Teuken tokenizer incurs only a 22% cost premium over English, a massive efficiency gain compared to standard English-centric tokenizers.
Benchmark Performance and Key Metrics
Teuken-7B delivers highly stable performance across linguistic borders. The OpenGPT-X team evaluated the model using the European LLM Leaderboard, which translates standard benchmarks like ARC, HellaSwag, TruthfulQA, and MMLU into 21 EU languages.
In instruction-tuned evaluations, Teuken-7B competes aggressively with Mistral-7B-Instruct-v0.3 and Meta-Llama-3.1-8B-Instruct. Teuken-7B frequently outperforms its peers in the HellaSwag and TruthfulQA benchmarks for European languages. More importantly, the model exhibits a remarkably low standard deviation across different languages. This means a prompt written in Bulgarian or Romanian yields the same logical consistency and accuracy as a prompt written in English or German.
The mid-2025 release of Teuken-7B v0.6 brought a 7% average performance boost over the original v0.4 release. This update significantly improved robustness and reliability across all tested languages.
Supported Tasks and Modalities
Teuken-7B is a text-to-text model designed for enterprise integration. It excels at Retrieval-Augmented Generation (RAG), document summarization, information extraction, and conversational AI.
Because the model natively understands European cultural contexts and administrative nuances, it is highly effective for public sector and highly regulated industries. Deutsche Telekom currently deploys Teuken-7B to automate medical plans and process citizen welfare applications. The model integrates directly with the Gaia-X infrastructure, ensuring that sensitive corporate and government data complies with strict European data protection regulations.
Licensing and Access Options
The OpenGPT-X consortium distributes Teuken-7B through Hugging Face and Ollama under a tiered licensing structure to accommodate different use cases:
- Teuken-7B-instruct-commercial: Released under the permissive Apache 2.0 license. This version is fully cleared for enterprise deployment and commercial applications.
- Teuken-7B-instruct-research: Restricted to academic and research use. It achieves slightly higher benchmark scores (1-2% better) because it includes instruction-tuning datasets that prohibit commercial use.
- Teuken-7B-base: The foundational pre-trained model, available under a CC BY-NC 4.0 license for non-commercial fine-tuning.
The Sovereign AI Alternative
For ML teams building applications for the European market, Teuken-7B offers a transparent, open-source alternative to black-box commercial APIs. It avoids the cultural and linguistic biases inherent in English-first models. As the OpenGPT-X consortium shifts focus to the 100-billion-parameter SOOFI model this year, the 7B variant remains a highly efficient, cost-effective foundation for local deployment and specialized fine-tuning.