Page 130 - The-5th-MCAIT2021-eProceeding
P. 130
2. Literature Review
2.1 Literature Review on Phishing Email Classification Features
There are several types of email phishing attacks which are: Deceptive Phishing, Spear Phishing, and
Whaling (Birlea, M.C., 2020). In deceptive phishing attacks, the common attack occurred mostly by using
email. The differences in email content in terms of writing, word choice, and language style or in terms of
stylometric features. On authorship identification, stylometric features are being used for identify the authorship
of an object mostly constructed by text (Kumar et al., 2018). There is still uncertainty of stylometric features
combined with email features on deceptive phishing email classification and detection. In other word, there are
still human behavior features that can be observed more on deceptive phishing email classification, e.g. word
choices, grammar, emotion of context, etc. that can be combined with email behavior features (header, body,
and URL) (Xiujuan et al., 2019).
In the current research on classification area, features selection and extraction are the most important aspect
on how good the result of the classification will be. In text classification, there are several types of features
representations used on the research one of which is on linguistic aspect that covers the writing, grammar, word
choices, and tones on the text part of the email. Technique that commonly used on email classification is using
machine learning or deep learning approach. For machine learning technique, features engineering part are
extracted manually from the dataset into the classifier, as for deep learning technique, embedding is one of the
methods to convert the features into a vector space model. As for the representation of the document, the
behavior features that has selected will be transformed into an embedding or vector form as one of deep learning
requirement to process the selected features (Gomez Adorno et al., 2018). In general, the combination of
behavior features that selected will become the main feature for determining an email is categorized as phishing
or not phishing either in machine learning or deep learning approach.
2.2 Literature Review on Human Behavior Features
There are some behavior features that used by several previous research on the area of classification.
Commonly, the human behavior features that has been used are generally included in the stylometric class.
Stylometry is an analysis of features that can be quantified such as sentence length, vocabularies, and
frequencies. Therefore, any text related that can be measured is classified as stylometric features (Gomez
Adorno et al., 2018).
Stylometric features are divided into two categories: low-level and high-level feature. Low-level features
cover the number of words, characters, n-grams, etc.) and high-level feature is linguistic features (rhythmic,
grammatical, tones). Each of the categories have sub categories which are word-based and character-based
depend on the context provided (Lagutina et al., 2020). Typically, set of stylometric features used are divided
into five categories (Sharon Belvisi et al., 2020):
Lexical: Set of characters and words (e.g. character count, word count, vocabulary richness), Structural:
The way writer organizes the element in text. (e.g. lines count, paragraph count, etc), Content-specific:
Frequency of particular/specific keyword in text, Syntactic: Syntax of the text. (e.g. punctuation, function
words), and Idiosyncratic: Capture unique element of author. (e.g. misspelt word).
3. Proposed Method
From the explanation on section 2 and 3 above, it can be seen that there has been no significant impact in
identifying deceptive phishing emails using stylometric features, which are generally used for authorship
identification. The research methodology in this study is based on experimental research and focused on
determining the best behavior features on detecting phishing email. By processing the dataset and using
measurable and observable features, it is continued by conducting various experiments to obtain satisfying
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [117]
Artificial Intelligence in the 4th Industrial Revolution