SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Program Your Favorite Data Science Pipeline in Spark


Workshop: EduHPC-23: Workshop on Education for High Performance Computing

Authors: H. Martin Bücker, Marieke Plesske, Johannes Schoder, and Wolf Weber (Friedrich Schiller University Jena, Germany)


Abstract: Designed for the master's degree program in "Computational and Data Science," the Faculty of Mathematics and Computer Science at Friedrich Schiller University Jena, Germany, offers a course that introduces students to distributed processing on massive datasets. Within that course, there is a three-week programming project where students learn to design, construct, and improve data analysis and machine learning pipelines using Hadoop, MapReduce, and Spark on the university’s central compute cluster. This short note sketches the main idea of the programming project, gives an example of a project instance, and reports on classroom experiences.





Back to EduHPC-23: Workshop on Education for High Performance Computing Archive Listing



Back to Full Workshop Archive Listing