Data analytics is a field of study involving computational statistics, data mining and machine learning, to explore data sets, explain phenomena and build models for inference and prediction. The course begins with an overview of some traditional analysis approaches including ordinary least squares regression and related topics, notably diagnostic testing, detection of outliers and methods to impute missing data. Next comes nonlinear regression, and regularization models including ridge regression. Generalized linear models follow, emphasizing logistic regression and including models for polytomous data. Variable subsetting is addressed through stepwise procedures and the LASSO. Supervised machine learning topics include the basic concepts of resampling, boosting and bagging and several techniques: Decision Trees, Classification and Regression Trees, Random Forests, Conditional Random Forests, Adaptive Boosting, Support Vector Machines and Neural Networks. Unsupervised approaches are addressed through applications using principal component analysis, k-means Clustering, Partitioning Around Medoids and Association Rule Mining. Methods for assessing model predictive performance are introduced including Confusion Matrices, k-fold Cross-Validation and Receiver Operating Characteristic Curves. Environmental and public health applications are emphasized, with modeling techniques and analysis tools implemented in R.
Course Offerings
There are no sections currently offered, however you can view a sample syllabus from a prior section of this course.