Page 164 - The-5th-MCAIT2021-eProceeding
P. 164

Job Scheduling Performance Issues and Solutions of Big Data
                        Applications in Apache Spark: A Review


                                                                                     c
                                      a
                                                                 b
               Hasmila Amirah Omar *, Shahnorbanun Sahran , Nor Samsiah Sani  and
                                                            d
                                            Azizi Abdullah
              a,b,c,d Faculty of Information Science and Technology, The National University of Malaysia, UKM Bangi, 43600, Malaysia
                                          *Email: hasmilaomar@gmail.com


        Abstract
        Big data analytics has been an active area of research recently due to the industrial market. It has a significant impact on
        processing large datasets to extract hidden patterns and information for supporting decisions. Many big data processing
        frameworks have been designed to achieve the aims. Among the popular ones that are mostly adopted is the Apache Spark.
        It has become one of the widely used open-source processing frameworks for large-scale datasets. It is due to the framework
        capability that uses in-memory computations that enable fast processing. The key factor behind this efficient processing is
        the job scheduling management system. It is the crucial component in managing resources that influence the execution of
        any big data applications. Scheduling jobs is a challenging task, especially when the deployed cluster has different hardware
        capacities and incoming jobs can be heterogeneous. Some important factors need to be considered to further improving the
        performance  of  this  framework.  Hence,  this  paper  aims  to  provide  a  review  of  Apache  Spark  in  terms  of  scheduling
        performance challenges and previous efforts solutions. The analysis of this paper may provide better insights and a roadmap
        for further enhancement of job scheduling in Apache Spark.

        Keywords: Big data analytics; Apache Spark; Job scheduling; Performance issues

        1.  Introduction

           Big data is a term that refers to the large volume of data that grow exponentially with time (Khalil et al.,
        2020). The data can be in the form of structured, semi-structured and unstructured data collected by organisation
        to be analysed for finding informative insights that lead to better decisions. However, the massive and complex
        data  exceeded  the  ability  of  traditional  computing  power  to  manage  and  capture  the  hidden  potentials.
        Therefore, one of the top big data processing framework in use today such as Apache Spark, an open-source
        framework that can quickly process large-scale data sets in parallel (Zaharia et al., 2010). It utilised in-memory
        computing features that support batch, iterative, interactive and streaming applications, which are useful for
        complex computations to have different computation modes in one platform. An important feature of Spark is
        its RDD (Resilient Distributed Datasets). RDD is a distributed set of read-only elements, which can only be
        generated by deterministic operations.
           Although  Apache  Spark  is  relatively  good  compared  to  other  frameworks,  it  also  has  a  performance
        bottleneck in job scheduling (Islam et al., 2020). It is difficult for beginners to comprehend the scheduling
        performance issues and the research solutions behind it instantly to further improve the performance. Therefore,
        this  paper  aims  to  provide  a  concise  source  of  Apache  Spark  information,  specifically  on  its  scheduling
        performance bottleneck. We highlight the previous and some recent works about Apache Spark's enhancement
        based on the identified issues. Thus, providing some development directions for framework optimization. The
        rest of this paper is organized as follows. Section 2 describes Apache Spark's background, which includes its
        features and job processing flow. Section 3 presents reviews of Spark scheduling performance issues which
        leads to solutions from past researches. Finally, section 4 presents our conclusion of this review paper.








        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [151]
        Artificial Intelligence in the 4th Industrial Revolution
   159   160   161   162   163   164   165   166   167   168   169