SparkLab January 2020

AND

Apostolos N. Papadopoulos, Associate Professor (papadopo@csd.auth.gr)
Data Engineering Lab, Dept of Informatics, Aristotle University of Thessaloniki
Last Update January 2020

Apache Spark is a very promising distributed engine for data management and analytics. In this tutorial, we are going to demonstrate the most fundamental concepts of Spark using the Scala/Python programming language. In fact, Spark has been written in Scala, and therefore, writting code in Scala, is like talking to Spark in its "mother tongue". However, since Python is becoming one of the most widely used languages in Data Science, we are going to give examples in Python as well.

In this tutorial, we will study the very basics in cluster computing using the Scala/Python programming language and the exciting Spark engine. Spark steadily becomes the state-of-the-art in cluster computing and big data processing and analytics due to the excellent support it provides for several domains such as: SQL processing, Streaming, Machine Learning and Graphs. In addition, Spark supports four programming languages: Scala, Java, Python and R. To gain as much as possible from this lecture, you are advised to install the required software components in your machines. In the provided notes you may find guidelines for installing Spark and Scala/Python in order to be able to create Spark applications. If you are not feeling confortable with this, then you should provide yourself access to a machine where Spark, Scala and Python are already installed and configured properly. In this lecture, we are going to discuss general issues related to Spark and its basic architecture, the supported libraries and the most important topics towards the design and implementation of efficient applications. Knowledge of Scala/Python is not required, but for sure it is helpful. However, programming experience in any language is a plus. Anyway, this is about application development, therefore you WILL make your hands dirty with code.

Download and read carefully the pdf containing the guidelines related to installation and use of the examples. Moreover, download the examples tarfile that we are going to use. In the guidelines pdf you will find information explaining where to extract the examples, and how to compile and run them. Be sure to install everything BEFORE the seminar.

Scala slides *** Spark slides *** Guidelines *** Source code