This course investigates the theory and practice of modern large-scale database systems. Large-scale approaches include distributed relational databases; data warehouses; and non-relational databases including HDFS, Hadoop, Accumulo for query and graph algorithms, and Mahout bound to Spark for machine learning algorithms. Topics discussed include data design and architecture; database security, integrity, query processing, query optimization, transaction management, concurrency control, and fault tolerance; and query formulation, graph algorithms, and machine learning algorithms on large-scale distributed data systems. At the end of the course, students will understand the principles of several common large-scale data systems including their architectures, performance, and costs. Students will also gain a sense of which approach is recommended for different requirements and circumstances.
EN.605.202 Data Structures; EN.605.641 Principles of Database Systems or equivalent. Familiarity with “big-O” concepts and notation is recommended.