Program Your Favorite Data Science Pipeline in Spark

SC23 Proceedings

Workshops Archive

Program Your Favorite Data Science Pipeline in Spark

Workshop: EduHPC-23: Workshop on Education for High Performance Computing

Authors: H. Martin Bücker, Marieke Plesske, Johannes Schoder, and Wolf Weber (Friedrich Schiller University Jena, Germany)

Abstract: Designed for the master's degree program in "Computational and Data Science," the Faculty of Mathematics and Computer Science at Friedrich Schiller University Jena, Germany, offers a course that introduces students to distributed processing on massive datasets. Within that course, there is a three-week programming project where students learn to design, construct, and improve data analysis and machine learning pipelines using Hadoop, MapReduce, and Spark on the university’s central compute cluster. This short note sketches the main idea of the programming project, gives an example of a project instance, and reports on classroom experiences.

Back to EduHPC-23: Workshop on Education for High Performance Computing Archive Listing

Back to Full Workshop Archive Listing