Course Number
705.771
Next Offered
Spring 2026
Location
Online
Course Format
Online - Asynchronous, Online - Synchronous

This course delves into Mechanistic Interpretability, a field that seeks causal explanations for how modern neural networks compute. By analyzing the outputs of neural networks as mathematical functions of their inputs, this discipline moves beyond attribution maps to reverse engineer the circuits, features, and algorithms learned by large language models. Students will gain hands-on experience with foundational methods in mechanistic interpretability, focusing on techniques for extracting features, detecting misalignment in LLMs, and reverse engineering neural circuits. The course provides a rigorous and detailed practical understanding of this critical area at the forefront of artificial intelligence. Topics covered include classical AI interpretability techniques such as decision trees, saliency maps, feature inversion, and linear probes; foundational concepts like features, QK and OV circuits, and induction heads; advanced techniques involving sparse autoencoders, induction heads, alignment strategies, and AI safety methodologies; and seminar discussions on existential risk, exploring contemporary thinkers in the realms of AI safety and long-term societal impact.

Course Prerequisite(s)

EN.705.651 Large Language Models: Theory and Practice

Course Offerings

New
Canceled

Introduction to Mechanistic Interpretability

705.771.8VL
01/22/2026 - 04/30/2026
Thur 7:20 p.m. - 10:00 p.m.
Semester
Spring 2026
Course Format
Online - Synchronous
Location
Online
Cost
$5,455.00
Course Materials