Page 65 - The-5th-MCAIT2021-eProceeding
P. 65
and age as important features. Ajibade et al. (2020) predicted student academic performance using demographic,
academic background, parents participation on learning process and behavioral features in a web based
education system.
This paper aims to introduce the preprocessing part on all important features in the development of student’s
performance predictive model.
2. Methodology
One of the most important phase in building a student performance modelis data preprocessing. Steps
involved in the preprocessing part were as follows:
Step 1: In data collection, the data used for this study was obtained and applied for from the Policy Planning
and Research Division (BPPD), Ministry of Higher Education (MOHE). The raw data received is in comma
separated values (CSV) format containing 248,568 public university graduates’ data who had completed a
bachelor’s degree in 2015 to 2019. Table 1 lists the datasets included in this study.
Table 1. The list of datasets along with its information.
Dataset Total of Attributes
Students 29
Activities 7
Awards 5
Industrial Training 4
MPP 3
Employment 5
Total 53
Step 2: In data data integration, six different datasets are combined into one dataset linked to student ID.
Step 3: In data cleaning, a data statistics review found that there are several records that have missing values.The
attributes that contain some missing values are replaced with the most frequent value. Meanwhile Dewan
Undangan Negeri (DUN), parliament and postal code attributes are removed entirely from datasets because the
number of missing values were too high.Lastly, recurring attributes which listed similar records, are also
removed from the dataset.
Besides that, the discretizationaccording to interval labels and conceptual labels has been made. Numerical
attributes such as CGPA hasbeen discretizetofour labels namely first class, second class upper, second class
lower and third class, accordingly.The purposes of the discretization are to handle noise, simplify the original
data, improve data processing efficiency, generate more easily-described data representation and enhance the
understanding of data mining results later on.
Step 4: In data transformation, attribute constructioncan help improve accuracy and understanding of the data.
To leverage the date attributes available in the dataset, calculations and generation of new attributes such asage
and duration of study have been performed.Feature engineering technique is then implemented in order to avoid
duplication of student ID as theyhave several records in the activity and awards datasets.
Step 5: Correlation analysis. Before developing the predictive models, we conducted a statistical analysis using
Spearman correlation to discern how significant the relationship between the study’s class label (employment
status) and the other variables is.
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [52]
Artificial Intelligence in the 4th Industrial Revolution