Data mining has become very important in corporate decision making, and is becoming increasingly important in government. With the advent of large data warehouses, organizations have access to huge quantities of potentially valuable data that they would like to mine in order to produce business intelligence. This course provides an advanced introduction to the theory and practice of data mining. The emphasis of the course will be on the following topics: opportunity identification, estimating the value of a data mining solution, process standards for data mining, mathematical problem formulation, complexity control and Vapnik-Chervonenkis theory, optimization algorithms, data and dimensionality reduction techniques, regression methods, and predictive classification. Techniques referenced will include classical statistical approaches, neural networks, decision trees, and local smoothing methods. These concepts will be introduced through lectures, readings, applied problem solving, and a major project. Most of the examples to illustrate these applications will come from banking, insurance, and direct marketing.
Prerequisites: Multivariate calculus, familiarity with linear algebra and matrix theory (e.g., 625.409) and a course in statistics (such as 625.403). This course will also assume basic familiarity with multiple linear regression and basic ability to program in MATLAB, FORTRAN, or other programming language. Computer-based homework assignments will be given. Students are encouraged to contact the instructor for additional information.