Deep Learning Using Transformers

Course Number

705.744

Next Offered

Fall 2025

Primary Program

Artificial Intelligence

Location

Online

Course Format

Online - Synchronous

Transformer networks are a new trend in Deep Learning. In the last decade, transformer models dominated the world of natural language processing (NLP) and have become the conventional model in almost all NLP tasks. However, developments of transformers in computer vision were still lagging. In recent years, applications of transformers started to accelerate. This course will introduce the attention mechanism and the transformer networks by understanding the pros and cons of this architecture. The importance of unsupervised or semi-supervised pre-training for the transformer architectures, as well as their impact for developments of large-scale foundation model. This will pave the way to introduce transformers in computer vision. Additionally, the course aims to extend the attention idea into the 2D spatial domain for image datasets, investigate how convolution can be generalized using self-attention within the encoder-decoder meta architecture, analyze how this generic architecture is almost the same in image as in text and NLP, which makes transformers a generic function approximator, and discuss the channel and spatial attention, local vs. global attention among other topics. Furthermore, we will also study different neural architectures that are designed for several fundamental tasks in computer vision, namely, classification, object detection, semantic and instance segmentation. In particular, vision transformer, pyramid vision transformer, shifted window transformer (Swin), Detection Transformer (DETR), segmentation transformer (SETR), and many others will be explored. The course also examines the application of Transformers in video understanding with focus on action recognition and instance segmentation and will emphasize recent developments of transformers in large-scale pre-training and multimodal learning covering self-supervised learning, contrastive learning with masked image modeling, multimodal learning, and vision foundation models.

Course Prerequisite(s)

EN.705.643 or equivalent PyTorch experience.

Course Offerings

Waitlist Only