Page 131 - The-5th-MCAIT2021-eProceeding
P. 131

results and be able to answer research questions and meet all the criteria of research objectives. In the design of
        this study, the research methods in conducting the testing process are as follows.
           The following is a brief description of the phases of this research: The first phase is about Dataset Collection
        & Preparation, second phase is preprocessing, third phase is feature selection and extraction, next phase is
        feature  embedding,  fifth  phase  is  deep  learning  algorithm  implementation,  and  the  last  phase  is  enhanced
        classification  model.  On  the  third  phase,  email  body  text  and  the  header  will  be  analyzed  and  processed
        according to the stylometric features category. Each category (lexical, structural, etc.) will be converted into
        vector for the preparation on the next phase. The values of the features are depending on what are the content
        of the selected features. For example, on the lexical features will be extracted in the form of characters or words
        (e.g. number of words, number of characters, number of capital letters, etc.) (Kumar et al., 2018).
           Based on the context of the features, there are some features that needed to be converted into numerical
        values that can be accepted for machine learning/deep learning process. One of them is by creating dictionary
        and applying one-hot encoding which convert text into binary number. Besides that, you can also use other
        methods by using n-gram. By counting the frequencies of each n-grams, the value can be used to represent the
        document as a vector. N-grams also can work for finding misspelling, language difference, and presence of
        other symbols in the text. Therefore, the extracted value from each of the selected feature will combined into
        one feature vector/embedding that can be inputted as the training data for the deep learning approach. Each of
        the selected  feature  will  have its own  value for determining the outcome of the classification process. By
        conducting several experiments, it is hoped that the behavior features with the best results can be identified.

        4. Future Direction

           Based on related work, there are several drawbacks from previous research regarding on technique, feature
        extraction and selection. There are researches used behavior features as the main features for the phishing email
        classification.  Xiujuan  et  al.  (2019)  used  3  human  behavior  features  namely  stylometric,  gender,  and
        personality, Kumar et al. (2020) used linguistic features and URL features for the detection of phishing email.
        Based on the research above, there are more behavior features that can be observed more for improving the
        phishing email detection. For example, the grammar and typo from the email content could also be categorized
        as the human behavior feature.
           On the past few years, there are several researches used deep learning method for email classification and
        the result is better than machine learning. Fang et al. (2019) develop a phishing email classification models
        based on RCNN and shows the performance result with the accuracy of 99.84% by using unbalanced dataset
        from Nazario and Enron email corpus. As for the machine learning approach, Kumar et al. (2018) used k-NN
        got highest accuracy of 95.48% and Kumar et al. (2020) used RF as the classifier, the experiment result is
        97.75% of accuracy. From the result above and by excluding the features used on the research, the deep learning
        approach has the highest accuracy for phishing email classification. As the area of phishing email classification,
        features selection has become important part on determining the email is phishing or not phishing. Combination
        of several features can be implemented to produce a good result and accuracy. By understanding that feature
        selection  is  very  influential  in  the  continuity  of  experiments  in  email  classification  is  a  fairly  advanced
        challenge.

        5. Conclusion
           The main scope of this paper covers the proposed method for classification and detection on phishing email
        with  combination  of  behavior  features  used  for  the  main  parameter  for  classification.  The  novelty  of  this
        research  is  identifying  the  best  combination  of  human  and  email  behavior  features  for  phishing  email
        classification. Combined the selected features into feature embeddings for the data representation on phishing
        email classification. Finally, improved phishing email classification model with selected behavior features.









        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [118]
        Artificial Intelligence in the 4th Industrial Revolution
   126   127   128   129   130   131   132   133   134   135   136