Is it safe?

AI is continuing to establish itself as a tool in science. If you want to use it, you should also be aware of the limitations of these methods: to what extent can the results of an algorithm be trusted? How much control is possible?

Omniscient algorithms that write essays on any topic, write poems or summarize books in an instant; competent chatbots that talk like a human conversation partner; graphics programmes that, based on brief descriptions, create images appearing both photorealistic and surreal – the tools of artificial intelligence (AI) seem to have arrived at the centre of society.

AI has already been playing an important role in science for years – for example, as a diagnostic aid in medicine. Pattern recognition is a prime example: on computer tomography images of the brain, for instance, intelligent algorithms can detect critical aneurysms, which are dangerous dilations of blood vessels, or they can classify skin cancer into benign and malignant tumours.

Bert Heinrichs

Still, algorithms can also deliver incorrect results. “They might be right in 99 out of 100 cases and wrong once. Which makes it difficult: we often don’t even understand why they are wrong in this one case,” says Prof. Bert Heinrichs, who heads the “Neuroethics and Ethics in AI” research group at the Institute of Neurosciences and Medicine (INM-7).

The experts’ lack of understanding is due to the fact that the algorithms usually play their cards close to the chest. They do not make their decisions according to predetermined rules, but learn on their own. In the case of skin cancer detection, the AI is fed countless photos of malignant melanomas (malignant tumours). Using this training data, it searches for specific characteristics that characterize this form of cancer. The AI receives feedback as to whether it was wrong or right after each round.

On the basis of this feedback, the AI adjusts its search patterns. In this way, the AI’s results keep getting better over the course of the training. However, the specific characteristics of an image on which the AI bases its decision cannot easily be determined. It resembles a black box. For users, therefore, the recurrent question is: how reliable are AI statements? Can I trust the algorithms?

Think before calculating

Stefan Kesselheim

“It’s often not so easy to answer these questions, particularly in research. In the case of a medical diagnosis, for example, I can have a doctor check whether a result is correct. But if the AI suggests a new material, I can’t be sure whether it will really do what it’s supposed to without testing,” says Dr. Stefan Kesselheim. He heads the Simulation and Data Lab “Applied Machine Learning” at the Jülich Supercomputing Centre (JSC), which supports research groups in implementing new AI applications. “An AI is always a complicated arithmetic formula that receives an input and produces a result. Before I get started, I need to think carefully about what I can use the result for. We also help with this assessment,” says Kesselheim.

The reliability of the delivered result depends on many factors: “Because of the black box problem, I don’t know exactly how the result came about.” The selection and type of data used to train an AI plays an important role in this. In general, according to the physicist, “The more extensive and varied the training data, the better the predictions.” Users need to be aware of these limitations in order to judge the validity of AI results. “AI is a powerful tool, but it’s not omnipotent – and it’s not really intelligent either,” cautions Kesselheim. “The methods often fail when it comes to generalizing the things they have learned to data that they have not seen during training.”

Like Kesselheim, Bert Heinrichs also recommends that AI results always be viewed with a certain degree of scepticism and the algorithm not blindly be trusted. As a philosopher, however, he struggles with the concept of trust. “Trust is a concept that originates in the interpersonal sphere: we trust other people, such as a doctor or a colleague,” he explains.

"We must learn to assess the reliability of AI – that is, to learn to deal appropriately with the fact that we can’t look into the black box."

Bert Heinrichs

In colloquial language, however, we also use the term to refer to objects, he admits. “A mountaineer would probably say that they can trust their rope. This means they assume that the rope will not break the next moment,” explains Heinrichs. He prefers the term reliability to trust. Referring to AI, this means: “We must learn to assess its reliability – that is, to learn to deal appropriately with the fact that we can’t look into the black box – and develop methods which allow us to judge whether a result is meaningful,” says the Jülich researcher.

Alexander Schug

Various criteria can be used to check whether a climbing rope is reliable: when purchasing, for example, by means of the European standard EN 892 or, in the event of wear, through visual, tactile and load tests. Comparable control criteria are required – and in some cases possible – for the results of AI as well. One example of is the research of Prof. Alexander Schug, who heads the NIC research group “Computational Structural Biology”. The physicist is interested in complex protein molecules. Fulfilling a variety of tasks in the organism, they offer an interesting target for drug development. For this, it is important to understand the detailed function of a protein in the body. The key to this is hidden in their three-dimensional structure.

Closing in on the Holy Grail

AI-generated illustration: AI and the Holy Grail

It is complex to determine this structure experimentally. For this reason, efforts have long been made to derive the molecular structure from the sequence of the individual building blocks. “This is the Holy Grail of structural biology,” says Alexander Schug. AI programmes can now predict the three-dimensional structure of any protein with astonishingly high quality – and in just a few minutes. Experimental determination, on the other hand, takes weeks, if not months. Another advantage of these AI methods: many provide not only a prediction, but also an assessment of the extent to which the result should be trusted. All the same, scepticism is also essential for Schug: “You must take a critical look at every calculated structure and ask yourself whether this result is plausible. And if there really are doubts, you simply need to test the whole thing experimentally,” says the Jülich researcher.

Timo Dickscheid

The results of AI processes cannot always be controlled by experts, however, as is the case with Prof. Timo Dickscheid’s research. He uses AI at the Institute of Neurosciences and Medicine (INM-1) to analyze huge amounts of image data from the finest slices of the brain. Among other things, the AI helps to recognize neuronal cells or to reassemble two-dimensional images into three-dimensional tissue on the computer. “Such algorithms are a widespread and very helpful tool in neuroscience today. Due to the large amount of data involved, however, we have to proceed differently than purely experimentally,” says Dickscheid.

A second AI programme can be used instead, working independently of the first AI and automatically taking over quality control. It compares, for example, how many cells the first AI programme marked when evaluating the image data from tissue sections and focuses specifically on detecting “surprises” in the data stream. If, for example, unexpectedly few cells are marked in an image, the control AI raises the alarm. “It then reports, ‘The result looks somewhat strange, you need to look at it again manually’,” explains Dickscheid.

Avoiding bias

Another control option is to shed light into the black box – by designing the algorithm in such a way that the results it delivers can be explained. “This is called Explainable AI, which in simple terms gives us an insight into the way AI thinks,” says Dickscheid. In medicine, for example, an AI of this type could explain in a short text why it suggests a certain diagnosis based on an X-ray image. To do this, the analysis tool is coupled with a language model that formulates sentences in natural language and explains why an aneurysm has formed here, for example, based on the thickening of an artery. “This model must of course be adapted to the respective application, that is it should be trained using specialist literature sothat it can draw on a suitable pool of facts for its answers,” says the computer scientist.

“This is called Explainable AI, which in simple terms gives us an insight into the way AI thinks.”

Timo Dickscheid

Comprehensive training is generally the basis for reliable statements. There are pitfalls in this too, however, warns Bert Heinrichs: for some applications, there is simply not enough data available, for example when it comes to rare diseases, or it is extremely time-consuming to produce any data at all. In materials science, for example, it can take half a day to produce an image of ceramic coatings. In cases like these, it may not be possible to produce enough data in a reasonable amount of time.

But distortions can occur even if sufficient data is available. This happens when training is biased, for example, that is, when the training data is not balanced. “Such a bias can be quite subtle,” says Heinrichs. If, for example, a certain skin colour is disproportionately often represented in the training images for skin cancer detection, this could lead to the AI learning an incorrect pattern and, thus, overlooking some tumours.

Another example is voice modules for chatbots such as ChatGPT, for smartphones or household appliances, where voice assistants respond to verbal questions and commands. One important point to keep in mind here is that language can be racist. Training data should therefore exhibit a certain degree of heterogeneity so that they do not distort the result in a particular direction.

AI-generated illustration: AI controls AI

The human factor

However, Heinrichs believes that researchers should always bear in mind that even when selecting data carefully, there is a risk of bias: “On the one hand, everyone in the world has a special perspective. This means that our value assumptions are always to some extent involved in the selection of training data, for example,” he points out. “And secondly: we don’t always know exactly what a heterogeneous training data set looks like. It’s therefore not easy to anticipate possible bias.”

The extent to which scientific experts can rely on an AI algorithm therefore depends on several factors. How well tested is the method? What are the controls? What initial data was available? The consequences of an incorrect result also play a major role. If the structure of a protein is incorrectly proposed in the development of a drug, this would become apparent in preclinical study at the latest. That would also be a kind of quality control.

The same applies to weather forecasts: on the one hand, results can be checked promptly, while on the other hand, errors are considered less dramatic, as today’s weather forecasts are not always perfectly reliable either. The situation is somewhat different with climate models, the predictions of which relate to much longer time periods. “It would be highly problematic, for example, if an AI were to come to the conclusion that the rise in temperature could be stopped with a certain measure, and after implementing this measure, it turned out 20 years later that it doesn’t work at all,” Stefan Kesselheim points out. AI results should not be seen as the one single truth, but only as an indication, says the physicist.

“We need to maintain a kind of holistic common sense and keep reminding ourselves that these algorithms only see a snippet. AI can undoubtedly do certain tasks for us very efficiently, but when it comes to reliability, we must remain very vigilant and always critically scrutinize individual results,” summarizes Heinrichs. So we will probably not be in a position to do without the human understanding of contexts and relations in the future.

AI with responsibility

Whether discrimination in a job application process or disinformation on the Internet: artificial intelligence needs an ethical and legal framework. In Europe, the AI Act is intended to ensure that intelligent algorithms are regulated. As a consultant for several research projects on AI in neurosciences and medicine, Bert Heinrichs has closely followed the legislative process: “The AI Act provides for AI systems to be categorized according to certain risk classes. The higher the risk, the stricter the manufacturers’ duties of documenting and testing become. Applications in medicine, for example, always belong in this high-risk category.” The philosopher’s expertise is also in demand for the funding programme of the renowned European Research Council (ERC) – as a reviewer for the ethical evaluation of project proposals: “Among other things, we are also talking about aspects of possible discrimination here,” explains the researcher. “If we know, for example, that an algorithm in medicine is not working properly for a certain minority due to a bias in the training data, then it is open to question whether public money should be spent on it.”

Text: Arndt Reuning | illustrations (created with the help of artifical intelligence): SeitenPlan with Stable Diffusion and Adobe Firefly | images: Forschungszentrum Jülich/Ralf-Uwe Limbach; Mareen Fischinger


Prof. Dr. Timo Dickscheid

Working Group Leader "Big Data Analytics"

  • Institute of Neurosciences and Medicine (INM)
  • Structural and Functional Organisation of the Brain (INM-1)
Building 15.9 /
Room 4009
+49 2461/61-1763

Prof. Dr. Bert Heinrichs


  • Institute of Neurosciences and Medicine (INM)
  • Brain and Behaviour (INM-7)
Building 14.6 /
Room 301/302
+49 2461/61-96431

Dr. Stefan Kesselheim

Head of SDL Applied Machine Learning & AI Consultant team

  • Institute for Advanced Simulation (IAS)
  • Jülich Supercomputing Centre (JSC)
Building 14.14 /
Room 3023
+49 2461/61-85927

Prof. Dr. Alexander Schug

Head of the NIC research group Computational Structural Biology

  • Institute for Advanced Simulation (IAS)
  • Jülich Supercomputing Centre (JSC)
Building 16.3 /
Room 228
+49 2461/61-9095
Issue 2-2023
Current Issue
All Issues
Print Subscribtion
Last Modified: 29.02.2024