Modhurita Mitra

The deRSE25 conference held in Karlsruhe, Germany from 25-27 February 2025 was my first deRSE conference - in fact my first Research Software Engineering conference ever. I am a Research Software Engineer at Utrecht University in the Netherlands, and the German RSE conference is the closest RSE conference for me. Thanks to a JuRSE travel grant from Forschungszentrum Jülich, I was able to attend the conference.

I use generative AI for research data processing, and I gave a talk on this subject in the session "ML-assisted and more general data workflows". Later in the day, during the coffee break and the poster session, I had thought-provoking conversations on this topic with several people who had attended the talk. I hope I was able to demonstrate the broad utility of generative AI as a general-purpose tool in research data processing - generative AI is useful not just for well-known use cases such as coding and creating chatbots, but it can be used for so much else - especially natural language processing tasks such as information extraction, natural language understanding, text classification, sentiment analysis, etc.

There were many interesting talks, but since there were parallel sessions, I couldn't attend all the ones I was interested in. However, the conference organizers provided the option of uploading slides to the conference webpage, and I could thus look up the presentations for the talks I couldn't attend.

I was interested in generative AI-focused talks, so I tried to attend most of the talks on this topic. However, there were times when I wasn't particularly interested in any of the parallel sessions, but even those sessions I chose randomly turned out to be surprisingly useful. For example, I attended a session about "Metadata in Research Software" and learnt about a user-friendly GUI tool one of the speakers had created which allows one to create a JSON object from a JSON schema. I have already used it myself, and plan to introduce it to less-technical domain researchers with whom I work. Domain researchers often find JSON schema challenging, and this tool enables them to create complex, nested data in JSON format through an intuitive GUI with dropdown lists and fill-in-the-fields options.

I also used this opportunity to attend a workshop at the co-located SE25 conference - Workshop on Generative and Neurosymbolic AI in Software Engineering (GenSE). This gave me insight into ways generative AI is being used in coding, and the successes achieved and challenges encountered.

I did a fair amount of networking, and found myself giving advice especially to younger women on career paths and progression. I learnt about the struggles of RSEs, and the ambiguity and versatility of this position within the academic and research landscape - I discovered that few of the conference attendees actually held the title of Research Software Engineer, but most self-identified as one.

To sum up, I enjoyed attending the deRSE25 conference in Karlsruhe and found it useful both technically as well as for networking and learning about areas of RSE that I don't usually work on or think about - in that sense it broadened my perspective and outlook as an RSE. I would definitely recommend this conference to other RSEs and I would myself like to attend future deRSE conferences if I have the opportunity.

Accepted Submission

Generative AI for Research Data Processing: Lessons Learnt From Three Use Cases

Generative AI has generated enormous interest since ChatGPT was launched in 2022. However, adoption of this new technology in research has been limited due to concerns about the accuracy and consistency of the outputs produced by generative AI. In an exploratory study on the application of this new technology in research data processing, we identified tasks for which rule-based or traditional machine learning approaches were difficult to apply, and then performed these tasks using generative AI.

We demonstrate the feasibility of using the generative AI model Claude 3 Opus in three research engineering projects involving complex data processing tasks:

1) Information extraction: Extraction of plant species names from historical seedlists (catalogues of seeds) published by botanical gardens.

2) Natural language understanding: Extraction of certain data points (name of drug, name of health indication, relative effectiveness, cost-effectiveness, etc.) from documents published by different Health Technology Assessment organisations in the EU.

3) Text classification: Assignment of industry codes to projects on the crowdfunding website Kickstarter.

We present the lessons learnt from this study: 1. How to assess if generative AI is a suitable tool for a particular use case, and 2. Strategies for enhancing the accuracy and consistency of the outputs produced by generative AI.

Last Modified: 14.03.2025