Course Number
705.746
Course Format
Online - Asynchronous

This course explores the foundations, methodologies and applications of self-supervised and multimodal representation learning, two of the most transformative paradigms in the advancements of modern deep learning era. We provide comprehensive coverage on how different self-supervised learning techniques leverage unlabeled data through pretext tasks, contrastive objectives, negative/positive sample designs, representation learning, and generative modeling, enabling state-of-the-art performance in various downstream tasks and applications with minimal supervision. This course also systematically explores the classic theory and methods of multimodal learning, where information from multiple data modalities – such as images, videos, texts, audio, and sensor streams – are integrated to build powerful and generalizable machine perception systems. Students will study key principles of multimodal representations, fusion strategies, and alignment methods that enable effective cross-modal reasoning, as well as pretraining paradigms, multimodal neural architecture, and the design of large-scale foundation models.