Page 169 - The-5th-MCAIT2021-eProceeding
P. 169
Table 1. (Continued)
Author (Year) Issue Solution Approach Methodology Limitation
Partition the input data in a fine-grained Time-consuming as the design
Wang et al. Simulated way and assign number of threads in the and implementation far more
(2018) Partition size annealing cluster with small scale data complex
Challenging tasks to obtain an
Gounaris et al. Greedy and Performed profiling to modify the accurate profiles with only few
(2017) Partition size Randomized partition size during execution test runs
3. Conclusion
Job scheduling is the most crucial element in any data processing framework. It plays a vital role in achieving
efficient utilization of resources. Existing scheduling solutions need to keep evolving as to properly support
new challenges that keep arising. These necessities are important to facilitate a higher performance of data-
intensive workload. In this paper, we present a review of Spark scheduling performance issues and compile the
solutions available in the literature accordingly. We provide an analysis of the related works to date and
suggestions for research directions. We hope that our effort provides an entry point for researchers to build a
roadmap for future work to improve Spark scheduling performance.
Acknowledgements
This work is supported by the Ministry of Higher Education Malaysia under the Fundamental Research
Grant Scheme (FRGS/1/2018/ICT02/UKM/02/6).
References
Bao, L., Liu, X., & Chen, W. (2019). Learning-based Automatic Parameter Tuning for Big Data Analytics
Frameworks. In Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018.
https://doi.org/10.1109/BigData.2018.8622018
Bian, Z., Wang, K., Wang, Z., Munce, G., Cremer, I., Zhou, W., … Xu, G. (2014). Simulating big data clusters
for system planning, evaluation, and optimization. In Proceedings of the International Conference on Parallel
Processing. https://doi.org/10.1109/ICPP.2014.48
Gounaris, A., Kougka, G., Tous, R., Montes, C. T., & Torres, J. (2017). Dynamic configuration of partitioning
in spark applications. IEEE Transactions on Parallel and Distributed Systems.
https://doi.org/10.1109/TPDS.2017.2647939
Gounaris, A., & Torres, J. (2018). A Methodology for Spark Parameter Tuning. Big Data Research.
https://doi.org/10.1016/j.bdr.2017.05.001
Gu, J., Li, Y., Tang, H., & Wu, Z. (2018). Auto-Tuning Spark Configurations Based on Neural Network. In
IEEE International Conference on Communications. https://doi.org/10.1109/ICC.2018.8422658
Hernández, Á. B., Perez, M. S., Gupta, S., & Muntés-Mulero, V. (2018). Using machine learning to optimize
parallelism in big data applications. Future Generation Computer Systems.
https://doi.org/10.1016/j.future.2017.07.003
Islam, M. T., Srirama, S. N., Karunasekera, S., & Buyya, R. (2020). Cost-efficient dynamic scheduling of big
data applications in apache spark on cloud. Journal of Systems and Software, 162, 110515.
https://doi.org/10.1016/j.jss.2019.110515
Khalil, W. A., Torkey, H., & Attiya, G. (2020). Survey of Apache spark optimized job scheduling in big data.
International Journal of Industry and Sustainable Development (IJISD) (Vol. 1). Retrieved from
http://ijisd.journals.ekb.eg39
Li, M., Liu, Z., Shi, X., & Jin, H. (2020). ATCS: Auto-Tuning Configurations of Big Data Frameworks Based
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [156]
Artificial Intelligence in the 4th Industrial Revolution