Course Number
705.744
Location
Online
Course Format
Synchronous Online

Transformer networks are a new trend in Deep Learning. In the last decade, transformer models dominated the world of natural language processing (NLP) and have become the conventional model in almost all NLP tasks. However, developments of transformers in computer vision were still lagging. In recent years, applications of transformers started to accelerate. This course will introduce the attention mechanism and the transformer networks by understanding the pros and cons of this architecture. The importance of unsupervised or semi-supervised pre-training for the transformer architectures, as well as their roles in foundation models will also be discussed. This will pave the way to introduce transformers in computer vision. Additionally, the course aims to will extend the attention idea into the 2D spatial domain for image datasets, investigate how convolution can be generalized using self-attention within the encoder-decoder meta architecture, analyze how this generic architecture is almost the same in image as in text and NLP, which makes transformers a generic function approximator, and discuss the channel and spatial attention, local vs. global attention among other topics. Further, time will be dedicated to studying the specific networks that are designed for mainstream computer vision tasks: classification, object detection and segmentation. In particular, ViT, shifted window transformer (Swin), Detection Transformer (DETR), segmentation transformer (SETR), and many others will be explored. The course concludes with the application of Transformers in video understanding with focus on action recognition and instance segmentation and will emphasize recent developments of transformers in large-scale pre-training and multimodal learning covering self-supervised learning, contrastive learning with masked image modeling, multimodal learning, and foundation CV models.

Course Prerequisite(s)

EN.705.643 or equivalent PyTorch experience.

Course Offerings

New
Open

Computer Vision Using Transformers

705.744.8VL
08/27/2024 - 12/10/2024
Tues 4:30 p.m. - 7:10 p.m.
Semester
Fall 2024
Course Format
Synchronous Online
Location
Online
Cost
$5,270.00
Course Materials