Page 131 - The-5th-MCAIT2021-eProceeding
P. 131
results and be able to answer research questions and meet all the criteria of research objectives. In the design of
this study, the research methods in conducting the testing process are as follows.
The following is a brief description of the phases of this research: The first phase is about Dataset Collection
& Preparation, second phase is preprocessing, third phase is feature selection and extraction, next phase is
feature embedding, fifth phase is deep learning algorithm implementation, and the last phase is enhanced
classification model. On the third phase, email body text and the header will be analyzed and processed
according to the stylometric features category. Each category (lexical, structural, etc.) will be converted into
vector for the preparation on the next phase. The values of the features are depending on what are the content
of the selected features. For example, on the lexical features will be extracted in the form of characters or words
(e.g. number of words, number of characters, number of capital letters, etc.) (Kumar et al., 2018).
Based on the context of the features, there are some features that needed to be converted into numerical
values that can be accepted for machine learning/deep learning process. One of them is by creating dictionary
and applying one-hot encoding which convert text into binary number. Besides that, you can also use other
methods by using n-gram. By counting the frequencies of each n-grams, the value can be used to represent the
document as a vector. N-grams also can work for finding misspelling, language difference, and presence of
other symbols in the text. Therefore, the extracted value from each of the selected feature will combined into
one feature vector/embedding that can be inputted as the training data for the deep learning approach. Each of
the selected feature will have its own value for determining the outcome of the classification process. By
conducting several experiments, it is hoped that the behavior features with the best results can be identified.
4. Future Direction
Based on related work, there are several drawbacks from previous research regarding on technique, feature
extraction and selection. There are researches used behavior features as the main features for the phishing email
classification. Xiujuan et al. (2019) used 3 human behavior features namely stylometric, gender, and
personality, Kumar et al. (2020) used linguistic features and URL features for the detection of phishing email.
Based on the research above, there are more behavior features that can be observed more for improving the
phishing email detection. For example, the grammar and typo from the email content could also be categorized
as the human behavior feature.
On the past few years, there are several researches used deep learning method for email classification and
the result is better than machine learning. Fang et al. (2019) develop a phishing email classification models
based on RCNN and shows the performance result with the accuracy of 99.84% by using unbalanced dataset
from Nazario and Enron email corpus. As for the machine learning approach, Kumar et al. (2018) used k-NN
got highest accuracy of 95.48% and Kumar et al. (2020) used RF as the classifier, the experiment result is
97.75% of accuracy. From the result above and by excluding the features used on the research, the deep learning
approach has the highest accuracy for phishing email classification. As the area of phishing email classification,
features selection has become important part on determining the email is phishing or not phishing. Combination
of several features can be implemented to produce a good result and accuracy. By understanding that feature
selection is very influential in the continuity of experiments in email classification is a fairly advanced
challenge.
5. Conclusion
The main scope of this paper covers the proposed method for classification and detection on phishing email
with combination of behavior features used for the main parameter for classification. The novelty of this
research is identifying the best combination of human and email behavior features for phishing email
classification. Combined the selected features into feature embeddings for the data representation on phishing
email classification. Finally, improved phishing email classification model with selected behavior features.
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [118]
Artificial Intelligence in the 4th Industrial Revolution