This course provides students with a solid understanding of the data engineering concepts needed to implement reliable data intensive systems. With the emergence of data science as a new field of study, data engineering has gained prominence as a discipline in its own right. Designing and deploying data intensive applications for production environments require skills and experience beyond data science. We start with the basic building blocks of data models, query languages, storage, retrieval, encoding, and schema evolution. Then we move on to distributed data where we examine the unique challenges faced with implementing distributed data systems and some approaches for mitigating these challenges. Throughout the course we consider reliability, scalability, and performance aspects of data stores, batch processing and streaming systems. To deepen our understanding of these concepts, students will implement data systems on their own personal computers using Docker. The technologies you will be working with include Jupyter Notebook, SQL engines, Apache Avro, Elasticsearch (and Kibana), Apache Spark, and Apache Kafka.
EN.635.601 Foundation of Information Systems Engineering. Prior experience with databases, SQL, and Python is recommended.