From Multimodal Foundation Models to Physical AI: Scalable and Structured Learning

Multi-modal LLMs have transformed AI’s ability to learn from rich, multi-modal inputs, spanning images, text, structured records, and more. Yet real-world domains such as healthcare and scientific discovery demand not only accuracy, but also scalability, robustness, and efficiency. This talk will present recent advances in scalable multi-modal learning with a focus on hybrid discrete–continuous representations that bridge structured knowledge and high-dimensional signals. We will discuss algorithmic designs that integrate combinatorial structures into neural networks via differentiable relaxations, enabling end-to-end training across heterogeneous modalities.

On the efficiency side, we will cover parameter-efficient fine-tuning and model/data compression strategies (e.g., token merging) that adapt large multi-modal foundation models to new domains at minimal computational cost. Applications will include medical vision–language models and the optimization and acceleration of large language models, illustrating how these techniques advance both predictive performance and interpretability while enabling deployment in resource-constrained settings.

Looking ahead, these ideas point toward Physical AI, where models must move beyond perception to generate actions consistent with the dynamics of the physical world. Emerging Vision–Language–Action (VLA) models aim to unify perception, reasoning, and control for embodied agents. We will discuss the key challenge of developing generative models that incorporate physical constraints and structured representations, enabling reliable planning, manipulation, and interaction in complex real-world environments.

Bio

Duy Nguyen is a final-year PhD candidate at the Max Planck Research School for Intelligent Systems (IMPRS-IS) and the University of Stuttgart, Germany. His research focuses on multimodal learning and hybrid discrete-continuous methods, combining optimal transport, graph-based algorithms, and efficient deep learning. His work spans applications in AI for low-resource domains such as Healthcare, AI for Science, and model optimization on the edge.

Starting in Summer 2026, he will join Stanford University as a Postdoctoral Scholar, where he will further explore learning-based methods for physical intelligence and multi-agent systems. His research has been published at top-tier venues in AI, vision, and robotics, including NeurIPS, ICML, ICLR, CVPR, AAAI, ICRA, and TMLR. He obtained his Master’s degree in Computer Science from Saarland University and the Max Planck Institute for Informatics (MPI-INF). During his studies, he was a visiting researcher at the University of California, San Diego (UCSD) and the ETH AI Center at ETH Zurich. He was also selected for the AI Newcomers program by the Federal Ministry of Education and Research (BMBF), Germany, in 2023.