Successful training on JUWELS Booster: OpenGPT-X releases multilingual AI language model
JSC scientists closely involved in research project
The large AI language model of the OpenGPT-X research project, called ‘Teuken-7B’, has been published and is now available for download from Hugging Face. It comprises seven billion parameters and was trained with the 24 official languages of the EU – with the help of JSC scientists on the JUWELS supercomputer, among others.
Teuken-7B is currently one of the few AI language models that have been developed multilingually as a whole. It contains about 50 per cent non-English pretraining data and shows stable and reliable performance across different languages. Researchers and industry can use the language model for their own artificial intelligence (AI) applications. Because it is provided as an open-source model, it is possible to operate customised models in real applications. In particular, international companies with multilingual communication needs will experience great added value here. In addition to the JUWELS system at JSC, Teuken-7B was developed with the ZIH HPC systems at the Technical University of Dresden.
"This release is a great success," says Dr. Stefan Kesselheim, who is leading the project at JSC together with Dr. Andreas Herten. "It is the first model of its kind that we have trained on our computer." The leading supercomputer system in Jülich is still JUWELS, but the scientists are already looking forward to JUPITER: "With our first European exascale computer, we will continue to expand projects of this kind and have the opportunity to focus more on AI research topics that require particularly high computing power." JSC's OpenGPT-X project team also includes Chelsea Maria John, Dr Carolin Penke and Jan Ebert. "We are very happy that our training on the JUWELS booster has resulted in such a promising model," says Chelsea, who just a few days ago presented her corresponding paper at the Supercomputing Conference (SC24) in Atlanta. "I am particularly pleased that the model is open source and can now be further trained for individual needs. This allows for a huge range of possibilities – very exciting from a scientific point of view."
In addition to JSC at Forschungszentrum Jülich, the two Fraunhofer Institutes, KI Bundesverband, TU Dresden, the German Research Centre for Artificial Intelligence (DFKI), IONOS, Aleph Alpha, ControlExpert and Westdeutscher Rundfunk (WDR) worked together as partners on OpenGPT-X.
More sustainable with the newly developed ‘Tokenizer’
In addition to model training, the OpenGPT-X team also addressed numerous research questions, such as how multilingual AI language models can be trained and operated in a more efficient way in terms of energy and costs. To this end, project staff developed a multilingual ‘tokenizer’, whose task is to break words down into individual word components – the fewer tokens, the more (energy) efficient and faster a language model generates the answer. The tokenizer developed reduced training costs compared to other multilingual tokenizers such as Llama3 or Mistral. This is particularly beneficial for European languages with long words such as German, Finnish or Hungarian. It can also be used to increase the efficiency of multilingual AI applications.
Final spurt for OpenGPT-X
The development of the Teuken-7B model has incorporated important research results from the OpenGPT-X project, such as tools and technologies for processing very large amounts of data, using powerful European HPC infrastructures and conducting efficient model training. In the future, the technology developed in OpenGPT-X will continue to provide all partners with the basis for training further models of their own. The research project itself, which started at the beginning of 2022, is about to be completed – it will continue until 31 March 2025 so that further optimisations and evaluations of the models can be carried out.
Access to Teuken-7B – two versions available
Developers from the scientific community or companies can download Teuken-7B free of charge from Hugging Face and work with it in their own development environment. The model is already optimised for chat through ‘instruction tuning’. Teuken-7B is available in two versions: one version that can be used for research purposes, and one version under the ‘Apache 2.0’ licence, which companies can use not only for research but also for commercial purposes and integrate into their own AI applications.
Download options and model cards can be found at the following link: https://huggingface.co/openGPT-X
See the press release from Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS: https://www.iais.fraunhofer.de/de/presse/presseinformationen/presseinformationen-2024/presseinformation-241126.html
Contact: Stefan Kesselheim (JSC), Andreas Herten (JSC)