This course provides a survey of concepts and techniques used in data engineering. With the emergence of data science as a new field of study, data engineering has gained prominence as a discipline in its own right. Designing and deploying data intensive applications for production environments require skills and experience beyond data science. The data engineer needs to be able to understand the reliability, scalability, and performance aspects of data stores and pipelines required for data analysis. Experience with relational databases and extract transform and load (ETL) techniques used in data warehousing are necessary foundations. The engineer also needs knowledge of publish and subscribe architectures as well as large scale computational clusters used for big data. This class will use data analysis examples to illustrate the concepts of data models, query languages, and data encoding. The concepts of replication and partitioning will be studied as they relate to transactional, batch, and streaming systems.

Course prerequisite(s): 

605.202 Data Structures and 635.601 Foundation of Information Systems Engineering. Prior experience with databases, SQL, and Python is recommended.

Course note(s): 

Students taking this class will need access to a modern OS and computer capable of running Docker to perform lab exercises and homework.

View Course Homepage(s) for this course.

Course instructor(s) :