SIS code: 
Semester: 
summer
E-credits: 
3
Examination: 
0/2 C
Guarantor: 

Data Intensive Computing

The aim of the course is to introduce methods used for processing huge data sets in distributed environment. Technological difficulties occurring in such environments are explained in the introductory sessions. This is followed by a presentation of the Sun Grid Engine (now Son of Grid Engine or Oracle Grid Engine). The MapReduce paradigm is explained next. The main part is devoted to the Apache Spark framework, a second generation  framework for distributed execution of MapReduce and more complex paradigms with Python, Scala, Java and R APIs. Contrary to Hadoop and MapReduce, Spark is suitable also for in-memory iterative computations like machine learning, which we cover in the final sessions.

Leaflet