Page 65 - The-5th-MCAIT2021-eProceeding
P. 65

and age as important features. Ajibade et al. (2020) predicted student academic performance using demographic,
        academic  background,  parents  participation  on  learning  process  and  behavioral  features  in  a  web  based
        education system.
           This paper aims to introduce the preprocessing part on all important features in the development of student’s
        performance predictive model.

        2. Methodology

           One  of  the  most  important  phase  in  building  a  student  performance  modelis  data  preprocessing.  Steps
        involved in the preprocessing part were as follows:
        Step 1: In data collection, the data used for this study was obtained and applied for from the Policy Planning
        and Research Division (BPPD), Ministry of Higher Education (MOHE). The raw data received is in comma
        separated values (CSV) format containing 248,568 public university graduates’ data  who had completed a
        bachelor’s degree in 2015 to 2019. Table 1 lists the datasets included in this study.

        Table 1. The list of datasets along with its information.

                                     Dataset           Total of Attributes
                                     Students               29
                                     Activities              7
                                     Awards                  5

                                     Industrial Training     4
                                     MPP                     3

                                     Employment              5
                                     Total                  53

        Step 2: In data data integration, six different datasets are combined into one dataset linked to student ID.
        Step 3: In data cleaning, a data statistics review found that there are several records that have missing values.The
        attributes  that  contain  some  missing  values  are  replaced with  the  most  frequent  value.  Meanwhile  Dewan
        Undangan Negeri (DUN), parliament and postal code attributes are removed entirely from datasets because the
        number  of  missing  values  were  too  high.Lastly,  recurring  attributes  which  listed  similar  records,  are  also
        removed from the dataset.
        Besides that, the discretizationaccording to interval labels and conceptual labels has been made. Numerical
        attributes such as CGPA hasbeen discretizetofour labels namely first class, second class upper, second class
        lower and third class, accordingly.The purposes of the discretization are to handle noise, simplify the original
        data, improve data processing efficiency, generate more easily-described data representation and enhance the
        understanding of data mining results later on.
        Step 4: In data transformation, attribute constructioncan help improve accuracy and understanding of the data.
        To leverage the date attributes available in the dataset, calculations and generation of new attributes such asage
        and duration of study have been performed.Feature engineering technique is then implemented in order to avoid
        duplication of student ID as theyhave several records in the activity and awards datasets.
        Step 5: Correlation analysis. Before developing the predictive models, we conducted a statistical analysis using
        Spearman correlation to discern how significant the relationship between the study’s class label (employment
        status) and the other variables is.










        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [52]
        Artificial Intelligence in the 4th Industrial Revolution
   60   61   62   63   64   65   66   67   68   69   70