Page 167 - The-5th-MCAIT2021-eProceeding
P. 167
Liang et al. (2018) proposed a methodology of WSMC (workload-specific memory capacity) that overcame
the problem of considering the workload characteristics based on memory capacity. Workload characteristics
may have diverse memory capacity requirements. They use a metric of data expansion ratio as the input data
for the workload classification. They also established the memory requirement prediction model for each
workload and achieved the performance improvement of 40% compared to the manual configuration. This
work, however, focusing on managing the memory space more efficiently rather than managing the executor,
which is of importance to improve the time performance of the cluster. More recent work by Zaouk et al. (2021)
used deep neural networks to develop a performance prediction model by embedding the workload
characteristics. Given the diversity of the workload, they used collected run-time learning via representation
learning. The goal of modeling is to derive a job-specific prediction model using the previous observations.
They built an optimizer that automatically recommends configurations for the subsequent execution to improve
the performance. Based on the result, they achieved a performance improvement of 52.4% on Spark streaming
workloads compared to the baseline study.
To summarize, the user often underestimates the impact of the workload requirement in their scheduling
decision. However, it is evidenced from previous works that it can cause low resource utilization. Although
there are advanced methods of using deep neural networks and reinforcement learning to estimate the resources
based on the workload, the performance needs to be improved. For future work, in-depth analysis and research
are needed to explore on how a scheduler could consider dynamic workload where the resource requirement
varies significantly during execution in multi-tenant environment.
2.3 Partition Size
Partition size refers to the level of parallelism in Spark schedulers. Scheduling in Spark allows running
multiple tasks in parallel across a cluster of machines. By default, the number of partition is given based on the
number of HDFS blocks of that file. If the partition size is too few, the scheduler cannot utilize all the cores
available in the cluster. In contrast, there will be excessive overhead in managing small tasks. The optimal
solution is to find a reasonable size partition that can utilize all the cores available and avoid overhead. In this
section, we summarize the solutions gathered from the literature.
Study by Hernandez et al. (2018), provide solutions to find an optimal partition size configuration by using
machine learning methods. The aim is to optimize the task parallelism of the application by predicting the
optimal number of tasks per executors and tasks per machine. The authors used regression methods and based
on the result, they achieved a 51% gain in performance when using the recommended settings. However, the
approach here does not consider different machine specifications in a cluster known as a heterogeneous
environment, a common environment in the distributed processing system.
Work by Wang et al. (2018), proposed a speculative mechanism to achieve optimal parallelism for Spark
scheduler using a simulator known as STLS (Software Thread-Level Speculation) using the simulated annealing
algorithm. The idea is to partition the input data in a fine-grained way and assign a number of threads in the
cluster with small-scale data. Based on the results, they have achieved a 15-23% speedup compared to the
baseline methods. However, this approach may be time-consuming as the design and implementation of Spark's
execution model is far more complex. In Wang et al. (2019), the authors proposed a performance model to
estimate the application's execution time for a given partition size. This paper aims to find an optimal partition
size that can reduce the application execution time. The idea is to predict the possible straggler tasks or skewed
task distribution by running with a fraction of input data. If the model predicts straggler tasks, the performance
model will repartition the input data by adjusting the partition size. On the other hand, if there are skewed tasks,
it will tune the locality setting. Based on the results, this paper demonstrates a performance improvement of
71%. However, this method can be costly when repartitioning a large amount of data as it will need to reshuffle
the data to ensure it is balanced across the partitions. Gounaris et al. (2017) proposed a novel algorithm for
configuring dynamic partitioning using Greedy and randomized approaches. The idea is to modify the degree
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [154]
Artificial Intelligence in the 4th Industrial Revolution