Page 166 - The-5th-MCAIT2021-eProceeding
P. 166
applied the LHS technique to minimize the number of training samples and use a recursive random search
algorithm to tune the parameter configurations. The results demonstrated that their proposed method reduces
the execution time by 22.8% to 40% on 9 different applications compared to the default settings.
In Wang et al. (2017), the idea is to used binary and multi-class classification algorithms to predict the
execution time under a given set of parameters. Data from actual executions for each workload is collected
using random sampling to train the model. Their proposed method improved the time performance of an average
of 36% lower running times when compared to the default settings. However, this technique needs to have
intensive training for every specific workload to achieve the optimal model. A study by Gu et al. (2018)
proposed tuning Spark parameter configurations for streaming applications using neural networks. They
generated training data set randomly and used a random forest algorithm to build a prediction model to predict
the execution time. On the other hand, the neural network approach is used to search for the optimal
configuration based on the prediction model. The experimental results show that the proposed approach
increases the performance of Spark streaming to 42.8% when compared with the default parameter
configuration. Recent study by Li et al. (2020) proposed the ATCS system, an automated tuning approach using
the Generative Adversarial Network (GAN) algorithm. The GAN algorithm is used to build a performance
prediction model by reducing the model's complexity using less training data. They implemented a Random
Parameter Generator (RPG) to produce random configurations for each workload as training data for the
prediction model. The results show that Spark's performance can be improved by an average of 3.5 to 6.9 times
compared to the performance of the default parameters.
Based on the previous works above, we can conclude that the experimental-based or trial and error approach
is less effective due to high-dimensional parameter space and time consuming as it requires intensive repetition
of experiments to test each combination of parameters. On the other hand, the simulation approach is a faster
way to test all the parameters combination. However, it is challenging to simulate the real Spark environment
as it requires depth knowledge of Spark internal systems to build one. Machine learning methods are gaining
much popularity among researchers to facilitate better results in getting optimal configurations. More efficient
machine learning methods should be explored to tackle this issue. Focusing not only on the improvement of the
prediction accuracy and time performance, but it is also important to have estimation prediction of the usages
of cores, memory, disk, and network before launching the application to ensure that all the resources are fully
utilized.
2.2 Workload Characteristics
Workload characteristics refer to the job characteristics, i.e., the size of data, type of data (e.g. SQL or
machine learning tasks, etc) or the resource requirements needed to run the data. Default Spark schedulers such
as FIFO does not consider the workload characteristics in the scheduling decision to have an easy and
straightforward implementation. By using this approach, it is hard to achieve efficient utilization of resources.
In this section, we examine different methods and techniques to enhance Spark scheduling performance based
on workload characteristics-aware.
Mao et al. (2019) proposed Decima that aims to improve the existing heuristic approach of task scheduling
by considering the workload characteristics. It uses reinforcement learning (RL) and neural networks to learn
scheduling policy through experience. It represents the scheduler as an agent that can learn from workload and
cluster conditions without relying on incorrect assumptions. Decima encodes its scheduling strategy by
observing the environment, taking action and improving its policy over time to make better decisions. The agent
will be rewarded after taking any action, and the reward is set based on the scheduling objective (e.g., minimize
average job completion time). The results show Decima improves average job completion time by 21% over
default schedulers. However, the authors do not mention whether it supports for multi-tenancy framework,
which is important for high-performance computing workload.
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [153]
Artificial Intelligence in the 4th Industrial Revolution