Institute of Neurosciences and Medicine (INM)

Computational Biomedicine (INM-9)

Seminar by Dr. Jens Glaser

Oak Ridge National Laboratory (USA)

Start

10th January 2025 02:00 PM

End

10th January 2025 03:00 PM

Location

Room 2009, Building 16.15

Finding the Feeble Signal in a FASTA Stack: Contrastive Learning in Drug Discovery

Contrastive learning, a self-supervised machine learning technique, offers a promising approach to uncover drug potential by modeling the correlation between molecular sequences and biological activity. A central question, however, is: do contrastive models generalize, and if so, is it because they learn mutual information encoded in the features, or simply because they memorize training data?

We analyze the mathematical underpinnings of the method’s success in terms of a minimal statistical model based on kernel canonical correlation analysis. While it offers flexibility and ease of use due to its parameter-free nature, it shares essential similarities with deep neural networks (e.g., contrastive language-image pretraining: CLIP-based models). As a baseline model, it could also inform more sophisticated LLM (Large Language Model)-based approaches. We thus obtain a measure of mutual information between two datasets, e.g., protein (FASTA) and ligand (SMILES) sequences. Our method does not require any biological affinity data to train, and its predictions are interpretable in terms of protein active site residues and relevant functional groups on the ligand. The method could therefore become a valuable tool for computer-aided rational drug design.

From a computational perspective, our implementation uses Google’s jax library to solve the generalized eigenvalue problem and optimally leverage GPUs through its LLVM backend, automated sharding and vectorization. Because the computation is matrix-free, it is also memory-efficient and remains efficient on 256 nodes of the Frontier supercomputer, allowing one to analyze large datasets containing 10⁵-10⁶ sequence pairs.

Datasets from various drug discovery applications highlight the potential merits of canonical correlation analysis in identifying small molecular binders to enzyme targets. We focus on examples of SARS CoV-2 main protease inhibitors, for which there is a wealth of experimental results, including some of our own. Finally, we discuss how these insights can be extended to other bio- and cheminformatics tasks, such as protein functional annotation and protein-protein interface prediction, where identifying meaningful sequence relationships is crucial for understanding molecular interactions and guiding experimental design.

Save to calendar (ICS)

Last Modified: 14.06.2025