Organizations today are generating massive amounts of data that are too large and too unwieldy to fit in relational databases. Therefore, organizations and enterprises are turning to massively parallel computing solutions such as Hadoop for help. The Apache Hadoop platform, with Hadoop Distributed File System (HDFS) and MapReduce (M/R) framework at its core, allows for distributed processing of large data sets across clusters of computers using the map and reduce programming model. It is designed to scale up from a single server to thousands of machines, offering local computation and storage. The Hadoop ecosystem is sizable in nature and includes many subprojects such as Hive and Pig for big data analytics, HBase for real-time access to big data, Zookeeper for distributed transaction process management, and Oozie for workflow. This course breaks down the walls of complexity of distributed processing of big data by providing a practical approach to developing applications on top of the Hadoop platform. By completing this course, students will gain an in-depth understanding of how MapReduce and Distributed File Systems work. In addition, they will be able to author Hadoop-based MapReduce applications in Java and also leverage Hadoop subprojects to build powerful data processing applications.
605.202 Data Structures; 605.681 Principles of Enterprise Web Development or equivalent Java experience.
This course may be counted toward a three-course track in Data Science and Cloud Computing.