Page 167 - The-5th-MCAIT2021-eProceeding
P. 167

Liang et al. (2018) proposed a methodology of WSMC (workload-specific memory capacity) that overcame
        the problem of considering the workload characteristics based on memory capacity. Workload characteristics
        may have diverse memory capacity requirements. They use a metric of data expansion ratio as the input data
        for  the  workload  classification.  They  also  established  the  memory  requirement  prediction  model  for  each
        workload and achieved the performance improvement of 40% compared to the manual configuration. This
        work, however, focusing on managing the memory space more efficiently rather than managing the executor,
        which is of importance to improve the time performance of the cluster. More recent work by Zaouk et al. (2021)
        used  deep  neural  networks  to  develop  a  performance  prediction  model  by  embedding  the  workload
        characteristics. Given the diversity of the workload, they used collected run-time learning via representation
        learning. The goal of modeling is to derive a job-specific prediction model using the previous observations.
        They built an optimizer that automatically recommends configurations for the subsequent execution to improve
        the performance. Based on the result, they achieved a performance improvement of 52.4% on Spark streaming
        workloads compared to the baseline study.
           To summarize, the user often underestimates the impact of the workload requirement in their scheduling
        decision. However, it is evidenced from previous works that it can cause low resource utilization. Although
        there are advanced methods of using deep neural networks and reinforcement learning to estimate the resources
        based on the workload, the performance needs to be improved. For future work, in-depth analysis and research
        are needed to explore on how a scheduler could consider dynamic workload where the resource requirement
        varies significantly during execution in multi-tenant environment.

        2.3 Partition Size

           Partition size refers to the level of parallelism in Spark schedulers. Scheduling in Spark allows running
        multiple tasks in parallel across a cluster of machines. By default, the number of partition is given based on the
        number of HDFS blocks of that file. If the partition size is too few, the scheduler cannot utilize all the cores
        available in the cluster. In contrast, there will be excessive overhead in managing small tasks. The optimal
        solution is to find a reasonable size partition that can utilize all the cores available and avoid overhead. In this
        section, we summarize the solutions gathered from the literature.
           Study by Hernandez et al. (2018), provide solutions to find an optimal partition size configuration by using
        machine learning methods. The aim is to optimize the task parallelism of the application by predicting the
        optimal number of tasks per executors and tasks per machine. The authors used regression methods and based
        on the result, they achieved a 51% gain in performance when using the recommended settings. However, the
        approach  here  does  not  consider  different  machine  specifications  in  a  cluster  known  as  a  heterogeneous
        environment, a common environment in the distributed processing system.
           Work by Wang et al. (2018), proposed a speculative mechanism to achieve optimal parallelism for Spark
        scheduler using a simulator known as STLS (Software Thread-Level Speculation) using the simulated annealing
        algorithm. The idea is to partition the input data in a fine-grained way and assign a number of threads in the
        cluster with small-scale data. Based on the results, they have achieved a 15-23% speedup compared to the
        baseline methods. However, this approach may be time-consuming as the design and implementation of Spark's
        execution model is far more complex. In Wang et al. (2019), the authors proposed a performance model to
        estimate the application's execution time for a given partition size. This paper aims to find an optimal partition
        size that can reduce the application execution time. The idea is to predict the possible straggler tasks or skewed
        task distribution by running with a fraction of input data. If the model predicts straggler tasks, the performance
        model will repartition the input data by adjusting the partition size. On the other hand, if there are skewed tasks,
        it will tune the locality setting. Based on the results, this paper demonstrates a performance improvement of
        71%. However, this method can be costly when repartitioning a large amount of data as it will need to reshuffle
        the data to ensure it is balanced across the partitions. Gounaris et al. (2017) proposed a novel algorithm for
        configuring dynamic partitioning using Greedy and randomized approaches. The idea is to modify the degree







        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [154]
        Artificial Intelligence in the 4th Industrial Revolution
   162   163   164   165   166   167   168   169   170   171   172