The field of data science is emerging to make sense of the growing availability and exponential increase in size of typical data sets. Central to this unfolding field is the area of data mining, an interdisciplinary subject incorporating elements of statistics, machine learning, artificial intelligence, and data processing. In this course, we will explore methods for preprocessing, visualizing, and making sense of data, focusing not only on the methods but also on the mathematical foundations of many of the algorithms of statistics and machine learning. We will learn about approaches to classification, including traditional methods such as Bayes Decision Theory and more modern approaches such as Support Vector Machines and unsupervised learning techniques that encompass clustering algorithms applicable when labels of the training data are not provided or are unknown. We will introduce and use open-source statistics and data-mining software such as R. Students will have an opportunity to see how data mining algorithms work together by reviewing case studies and applying techniques learned in hands-on projects.
Multivariate calculus, linear algebra, and matrix theory (e.g., EN.625.609 Matrix Theory), and a course in probability and statistics (such as EN.625.603 Statistical Methods and Data Analysis). This course will also assume familiarity with multiple linear regression and basic ability to program.