This course delves into Mechanistic Interpretability, a field that seeks causal explanations for how modern neural networks compute. By analyzing the outputs of neural networks as mathematical functions of their inputs, this discipline moves beyond attribution maps to reverse engineer the circuits, features, and algorithms learned by large language models. Students will gain hands-on experience with foundational methods in mechanistic interpretability, focusing on techniques for extracting features, detecting misalignment in LLMs, and reverse engineering neural circuits. The course provides a rigorous and detailed practical understanding of this critical area at the forefront of artificial intelligence. Topics covered include classical AI interpretability techniques such as decision trees, saliency maps, feature inversion, and linear probes; foundational concepts like features, QK and OV circuits, and induction heads; advanced techniques involving sparse autoencoders, induction heads, alignment strategies, and AI safety methodologies; and seminar discussions on existential risk, exploring contemporary thinkers in the realms of AI safety and long-term societal impact.
Course Prerequisite(s)
EN.705.651 Large Language Models: Theory and Practice
Course Offerings
|
New
Canceled
Introduction to Mechanistic Interpretability
01/22/2026 - 04/30/2026
Thur 7:20 p.m. - 10:00 p.m. |