AI Research Association
LAION e.V. is a non-profit organisation with members from all over the world that aims to make comprehensive machine learning models, datasets and associated code available to the general public.
27 March 2023
For the computationally intensive training of network models, they used Germany's most powerful machine for deep learning and AI, the JUWELS supercomputer at Forschungszentrum Jülich. In an interview, Dr Jenia Jitsev and Dr Mehdi Cherti explain how far the open-source development of artificial intelligence has come and what secret lies behind the advanced functionalities of the latest AI models.
With the LAION initiative, you have been pushing open-source and open science for large-scale machine learning for a few years now. What is your motivation?
Jenia Jitsev: The problem is that you have to go to very large scales for these new strong functionalities to emerge - this is what we see in current AI models. You need enormous computing resources, you need to collect enough data and you need people who are specialised in this kind of large-scale training. Thus, for a long time, this resource-intensive approach was only reserved for large companies like Google, Meta, or OpenAI.
However, the development in the companies takes place behind closed doors. The results are not published in such a way that they can be reproduced by other scientists. You cannot test them and experiment with them on your own. This makes it hard to validate whether the reported findings and functions are indeed there, and also creates various safety issues, as nobody can check independently how the model was created and which data was used for training.
When the US company OpenAI presented the image generator DALL-E a few years ago, it was immediately clear to us that we would also like to have such a model freely available for the scientific community to study it properly. Via the internet, we very quickly came across fellow enthusiasts who were pursuing the same goal, such as Christoph Schumann, one of the other main organisers of LAION, who is a high school teacher campaigning for AI in school education in Hamburg. Other members joined the association based in Germany from Paris, Bucharest, and Cologne, and later on from Seattle, Montreal, and Frankfurt.
The images were generated with the open-source text-to-image generator Stable Diffusion, which was trained with the free LAION datasets.
What role did the JUWELS supercomputer at Forschungszentrum Jülich play in the LAION project?
Mehdi Cherti: We used our expertise in executing large-scale deep learning to get the LAION community experts onto the supercomputers at Jülich, which have the computing power needed for the experiments and validation of the datasets and various learning algorithms. The JUWELS Booster in particular, with its powerful graphics accelerators, NVIDIA A100 GPUs, is ideally designed for training such AI models. This involves pre-training of the models. Although this only has to be done once, it is extremely computationally intensive. Pre-training on even comparatively small scales in this context quickly requires 300,000 GPU hours or more. Without a supercomputer, this alone would take over 34 years to finish.
The datasets published within LAION can be used to train new AI models. What are these AI models capable of?
Jenia Jitsev: One impressive feature is image generation. The open-source Stable Diffusion text-to-image generator based on our LAION-5B dataset is able to produce images on commands. Until now, something like this was only known from closed commercial models like DALL-E 2. The creativity that these new image models enable is simply overwhelming. This is a hardcore scientific problem from computer science and machine learning that people tried to solve for decades until now.
You can, for example, freely design the appearance of buildings and their surroundings by providing natural language descriptions. These generated images are not just some graphical garbage - they possess a consistent gist and layout across different scales, respect properly light conditions, reflections and shadows, like natural images do. The image information is well aligned, these images show clearly a basic understanding of the world and how things are arranged in it.
The new chatbot ChatGPT can formulate scientific papers and poems, but also write computer code and recently passed its first academic exams. How useful, in contrast, are current text-to-image models? Can they do more than just generate pretty pictures?
Mehdi Cherti: Yes, absolutely, there are almost endless application opportunities, ranging from accelerating material science and aiding to create new energy efficient battery components, to analysing and predicting activity of our sun from space satellite mission imaging data. Another important area is medicine. Even older, simpler models were already outperforming dermatologists, for example, in the ability to detect skin cancer from medical images. A corresponding practical application has already been developed, which runs even on a smartphone. The recent development in language-vision learning opens perspectives far beyond that – for instance, various forms of medical imaging, such as X-ray, ultrasound, MRI and so on, can be used to create language-vision AI models that will support doctors and even laymen in complex diagnostics.
Another example is generic robot navigation. In a recent work, a pre-trained language-vision model that we also use to validate our dataset was re-used to enable generic control of a robot without requiring further training on that task. The robot independently finds its way around its environment and follows a route based on free form natural language instructions. These capabilities are also important for autonomous driving when it comes to identifying objects on and off the road or guiding vehicles to their destination.
Besides images, the models can also process sounds, for example, to compose music or identify voices or translate audio into text. In a recent work, they take advantage of the fact that sound can be represented as an image-like object with so-called spectrograms, and use a similar model that is used in text-to-image generation in order to compose music based on a text description. You can also teach the neural networks to create 3D models, for example of buildings from natural text and some few rough 2D sketches. This can enormously boost designers’ work, allowing them to focus on creative aspects and avoid technical routine.
What is the "secret" of these astonishing new capabilities? How do the new AI models differ from earlier approaches?
Jenia Jitsev: One important aspect is training on both language and vision together in so-called self-supervised manner – previous approaches were using either natural images or natural text, but not both at the same time. This required a large amount of well-curated training data. Now, on the other hand, both language and vision data can be inconsistent and sometimes inaccurate. At the end of the day, it's a question of scale.
If you take a machine learning model and keep increasing the scale, two things happen. First, what you can already measure at the smaller scales is that some capabilities are improving. And second, in larger scales, it can happen suddenly that entirely new functionalities appear that are not there at all in the lower and middle scales. So, by increasing the model size, training time and the size of the dataset, the performance and accuracy of the model improves. In the larger scales, certain new functionalities emerge all of a sudden.
The big breakthrough a few years ago was when people realised that the more extensively the network models are trained in advance on generic data on sufficiently larger scales, the more robust and efficient they become. This goes so far that completely new tasks are satisfactorily executed after very few repetitions or straight away the first time - allowing for zero-shot transfer without any new training on new examples at all.
How does this zero-shot learning work in practice?
Mehdi Cherti: Here is the big difference to traditional machine learning. In the past, if you wanted to teach the models to distinguish between 100,000 different plant species, it would be necessary to get hundreds or even thousands of various examples for each plant, all correctly labeled, while it is often too hard or even impossible to collect such data. Exotic plant species, of which there are only one or two pictures, cannot be learned by such a system.
The new generalist self-supervised models, on the other hand, are trained in advance for a longer period of time on large volumes of generic data, usually using a rather simple, generic task. These are also known as “foundation models”. For a foundation language-vision model like CLIP, which we have trained, such a task is to tell whether an image and a text caption are likely to belong together or not. This can be done with image-text pairs that are available in datasets like LAION-5B.
After such a pre-training phase, these generic models are so robust that they become able to recognize and classify any new visual entities on their own. This means that for learning 100.000 new plant species, it might be enough to show only very few examples per plant, and the model will be able to handle recognition of those. The new language-vision foundation models, trained on our open public dataset, thus learn in a very data-efficient way and only need a few images which is also referred to as "few-shot learning" or "zero-shot learning".
How far has your free approach come in the meantime, also in comparison to the commercial alternatives?
Jenia Jitsev: We are still at the very beginning in terms of working with datasets on the very large scales. In zero-shot learning tests, models trained with our free data already perform similarly well as models from closed commercial vendors like OpenAI. The demonstration that it works so well with freely available data gathered from the public Internet was also one of the main reasons why we received this prestigious award at the NeurIPS conference late last year. Before that, no one knew what happens when you build models on such a large uncurated dataset.
The LAION-5B dataset we created consists of 5.8 billion text-image pairs. The data was obtained automatically with a relatively modest amount of human effort – any research group can now repeat the same procedure. This is done with so-called crawlers like Common Crawl, which scan the entire Internet. The collected data was further processed with a publicly available pre-trained CLIP model from OpenAI to filter out image and text pairs that are not matching together. This was certainly one of the reasons for the good results. But later tests showed recently that the approach also works without this filter. However, in that case, the performance is reduced, but this might be compensated by collecting more data. We have thus shown that it is in principle possible to build well-functioning self-supervised models simply with data from scraped from the Internet, not locked behind closed doors anymore and available for the broad research community to study their strengths and weaknesses together, allowing for further collaborative development.
The image generator Stable Diffusion 2.0, which is based on the LAION-5B dataset, can be tested on various websites: