Page 164 - The-5th-MCAIT2021-eProceeding
P. 164
Job Scheduling Performance Issues and Solutions of Big Data
Applications in Apache Spark: A Review
c
a
b
Hasmila Amirah Omar *, Shahnorbanun Sahran , Nor Samsiah Sani and
d
Azizi Abdullah
a,b,c,d Faculty of Information Science and Technology, The National University of Malaysia, UKM Bangi, 43600, Malaysia
*Email: hasmilaomar@gmail.com
Abstract
Big data analytics has been an active area of research recently due to the industrial market. It has a significant impact on
processing large datasets to extract hidden patterns and information for supporting decisions. Many big data processing
frameworks have been designed to achieve the aims. Among the popular ones that are mostly adopted is the Apache Spark.
It has become one of the widely used open-source processing frameworks for large-scale datasets. It is due to the framework
capability that uses in-memory computations that enable fast processing. The key factor behind this efficient processing is
the job scheduling management system. It is the crucial component in managing resources that influence the execution of
any big data applications. Scheduling jobs is a challenging task, especially when the deployed cluster has different hardware
capacities and incoming jobs can be heterogeneous. Some important factors need to be considered to further improving the
performance of this framework. Hence, this paper aims to provide a review of Apache Spark in terms of scheduling
performance challenges and previous efforts solutions. The analysis of this paper may provide better insights and a roadmap
for further enhancement of job scheduling in Apache Spark.
Keywords: Big data analytics; Apache Spark; Job scheduling; Performance issues
1. Introduction
Big data is a term that refers to the large volume of data that grow exponentially with time (Khalil et al.,
2020). The data can be in the form of structured, semi-structured and unstructured data collected by organisation
to be analysed for finding informative insights that lead to better decisions. However, the massive and complex
data exceeded the ability of traditional computing power to manage and capture the hidden potentials.
Therefore, one of the top big data processing framework in use today such as Apache Spark, an open-source
framework that can quickly process large-scale data sets in parallel (Zaharia et al., 2010). It utilised in-memory
computing features that support batch, iterative, interactive and streaming applications, which are useful for
complex computations to have different computation modes in one platform. An important feature of Spark is
its RDD (Resilient Distributed Datasets). RDD is a distributed set of read-only elements, which can only be
generated by deterministic operations.
Although Apache Spark is relatively good compared to other frameworks, it also has a performance
bottleneck in job scheduling (Islam et al., 2020). It is difficult for beginners to comprehend the scheduling
performance issues and the research solutions behind it instantly to further improve the performance. Therefore,
this paper aims to provide a concise source of Apache Spark information, specifically on its scheduling
performance bottleneck. We highlight the previous and some recent works about Apache Spark's enhancement
based on the identified issues. Thus, providing some development directions for framework optimization. The
rest of this paper is organized as follows. Section 2 describes Apache Spark's background, which includes its
features and job processing flow. Section 3 presents reviews of Spark scheduling performance issues which
leads to solutions from past researches. Finally, section 4 presents our conclusion of this review paper.
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [151]
Artificial Intelligence in the 4th Industrial Revolution