Data science and research data management
About
Data Science
Data Science, a multidisciplinary field that combines mathematics, statistics, computer science, and domain expertise, has become a cornerstone in scientific research, and is used to analyze large and complex data sets to discover new patterns, trends, and insights, leading to new findings and advancements.
Research for atmospheric and climate sciences generates vast amounts of data, from global remote sensing data to long term highly resolved climate simulations. Data Science techniques, including machine learning algorithms, data mining, predictive modeling, statistical analysis, and data visualization, help researchers make sense of this data and uncover hidden patterns and processes.
In addition to conventional data science tools like statistics based on e.g. ensemble simulations and machine learning, we use approaches like explainable machine learning, storylines and causal networks and pathways.
Research Data Management
Research Data Management (RDM) refers to the practices and processes for creating, organizing, preserving, and sharing research data in a manner that supports efficient and effective research, promotes reproducibility, and maximizes the potential for data reuse. In science, RDM plays a crucial role in ensuring the validity, accessibility, and usability of research data. Effective RDM practices help researchers manage atmospheric observational and climate simulation data throughout the research lifecycle, from data collection and processing to analysis, publication, and long-term preservation.
Data Organization and Metadata are essential aspects of RDM. Properly organized data with descriptive metadata according to the FAIR principles and community conventions like CF makes it easier for atmospheric researchers to locate, access, and understand data, reducing the time and effort required for data discovery and reuse.
Data Sharing and Access are important for maximizing the impact and value of research data. Open data sharing, where data is made available to the public, can lead to new discoveries, collaborations, and innovations. Controlled data sharing, where access is restricted to specific individuals or groups, is necessary for protecting sensitive or confidential data.
Data Security and Privacy are critical for protecting research data and ensuring the confidentiality and privacy of research subjects. Effective RDM practices include implementing access controls, encryption, and secure data transfer protocols.
Data Preservation and Long-term Access are essential for ensuring that research data remains accessible and usable over the long term. Properly preserving research data involves using reliable storage media, creating backups, and implementing data migration strategies as technology evolves.
During the last years international efforts have been made to push forward the so-called FAIR principles to address the main RDM topics. FAIR data are data which meet principles of findability, accessibility, interoperability, and reusability.
Research Topics
The above-mentioned topics are covered within the institute by using the data management software DataLad. In addition to the institute’s own decentralised internal disk space and the MeteoCloud hosted at JSC, scientists have access to internal and external services based on Gitea platforms that can be used to version, edit and distribute scientific data in a reproducible manner. The Atmospheric Data Research Information System (ATRIS) is the institutes external Gitea server and can be used to disseminate data and facilitate scientific research and data centered development in a access-controlled environment. For internal exchange between scientific groups we provide as a similar platform the Atmospheric Research Resources And Knowledge InfraStructure (ARRAKIS).
All data sets published are registered with Jülich Data, to get a DOI in order to make them referencable in scientific contexts.
Data sets at the ICE-4 derived from observations and laboratory experiments which are uploaded to external data platforms like MOSES, IAGOS or the HALO DB are enriched with metadata to make them as FAIR compliant as possible. The same applies to data created within numerical simulations, which are at least CF compliant.
Furthermore efforts have been started to include electronical laboratory notebooks (ELNs) into laboratory work flows in order to digitalize the whole process of data acquisition, processing and publication. The aim is to create interfaces between the ELNs and the GiTea server platforms, to make the digital work flows available to collaborators.