Foundation Models for Multimodal Scientific Data

From a machine learning perspective, scientific data are inherently multimodal. Taking materials science as an example: a single material can be characterized through multiple measurements, each producing a different modality. Processing histories typically appear in text format, powder X-ray diffraction (pXRD) patterns and density of states (DOS) are expressed as curves, and transmission electron microscopy (TEM) data may come as images. This heterogeneous set of modalities requires a model capable of handling diverse data types. Furthermore, we want such a model to support multiple downstream tasks, such as anomaly detection and inverse design. These needs motivate the development of a foundation model for materials science — one that can seamlessly integrate multiple modalities and adapt to varied tasks.

To this end, we pretrain the model using contrastive learning. To demonstrate this idea, we propose MatBind, a multimodal foundation model for materials. MatBind adopts a hub-and-spoke architecture that anchors several modalities to crystal structures. This design enables robust cross-modal retrieval, including emergent alignments between modalities that are never directly paired during training, thereby validating the SOL-AI principle of modular encoders aligned within a shared embedding space. Specifically, MatBind aligns crystal structures with DOS, pXRD, and free-text descriptions through contrastive learning, supporting queries such as “find structures matching this pXRD pattern” or “retrieve pXRD/DOS consistent with this structure.” Inspired by ImageBind, the hub-and-spoke architecture centralizes the structure encoder to maximize transfer between non-adjacent modality pairs and to support zero-shot cross-modal retrieval.

Illustration of the model architecture of MatBind.
Evaluation metrics for MatBind. The higher value, the stronger the binding between two modalities.

Beyond retrieval, MatBind has the potential to support inverse design. As an initial exploration, we investigate whether simple vector arithmetic in MatBind’s embedding space can facilitate inverse design workflows. Inspired by analogies in word embeddings, we examine whether shifting a base material’s embedding along directions associated with compositional or microstructural changes produces semantically consistent transformations across modalities (structures, pXRD, DOS, text). Preliminary results show encouraging qualitative trends, though this capability remains exploratory; ongoing work aims to refine training strategies and evaluation methods to better support such arithmetic. These findings point toward new possibilities for inverse design, such as defining target embeddings and navigating the latent space to identify candidate structures or processing protocols.

Contact:

Last Modified: 28.01.2026