Page 130 - The-5th-MCAIT2021-eProceeding
P. 130

2. Literature Review

        2.1  Literature Review on Phishing Email Classification Features

           There  are  several  types  of  email  phishing  attacks  which  are:  Deceptive  Phishing,  Spear  Phishing,  and
        Whaling (Birlea, M.C., 2020). In deceptive phishing attacks, the common attack occurred mostly by using
        email. The differences in email content in terms of writing, word choice, and language style or in terms of
        stylometric features. On authorship identification, stylometric features are being used for identify the authorship
        of an object mostly constructed by text (Kumar et al., 2018). There is still uncertainty of stylometric features
        combined with email features on deceptive phishing email classification and detection. In other word, there are
        still human behavior features that can be observed more on deceptive phishing email classification, e.g. word
        choices, grammar, emotion of context, etc. that can be combined with email behavior features (header, body,
        and URL) (Xiujuan et al., 2019).
           In the current research on classification area, features selection and extraction are the most important aspect
        on how good the result of the classification will be. In text classification, there are several types of features
        representations used on the research one of which is on linguistic aspect that covers the writing, grammar, word
        choices, and tones on the text part of the email. Technique that commonly used on email classification is using
        machine learning or deep learning approach. For machine learning technique, features engineering part are
        extracted manually from the dataset into the classifier, as for deep learning technique, embedding is one of the
        methods to convert the features into a vector space model. As for the representation of the document, the
        behavior features that has selected will be transformed into an embedding or vector form as one of deep learning
        requirement  to  process  the  selected  features  (Gomez  Adorno  et  al.,  2018).  In  general,  the  combination  of
        behavior features that selected will become the main feature for determining an email is categorized as phishing
        or not phishing either in machine learning or deep learning approach.
        2.2  Literature Review on Human Behavior Features

           There  are  some  behavior  features  that  used  by  several  previous  research  on  the  area  of  classification.
        Commonly, the human behavior features that has been used are generally included in the stylometric class.
        Stylometry  is  an  analysis  of  features  that  can  be  quantified  such  as  sentence  length,  vocabularies,  and
        frequencies.  Therefore,  any  text  related  that  can  be  measured  is  classified  as  stylometric  features  (Gomez
        Adorno et al., 2018).
           Stylometric features are divided into two categories: low-level and high-level feature. Low-level features
        cover the number of words, characters, n-grams, etc.) and high-level feature is linguistic features (rhythmic,
        grammatical, tones). Each of the categories have sub categories which are word-based and character-based
        depend on the context provided (Lagutina et al., 2020). Typically, set of stylometric features used are divided
        into five categories (Sharon Belvisi et al., 2020):
           Lexical: Set of characters and words (e.g. character count, word count, vocabulary richness), Structural:
        The  way  writer  organizes  the  element  in  text.  (e.g.  lines  count,  paragraph  count,  etc),  Content-specific:
        Frequency of particular/specific keyword in text,  Syntactic: Syntax of the text. (e.g. punctuation, function
        words), and Idiosyncratic: Capture unique element of author. (e.g. misspelt word).

        3. Proposed Method

           From the explanation on section 2 and 3 above, it can be seen that there has been no significant impact in
        identifying  deceptive  phishing  emails  using  stylometric  features,  which  are  generally  used  for  authorship
        identification.  The  research  methodology  in  this  study  is  based  on  experimental  research  and  focused  on
        determining  the  best  behavior  features  on  detecting  phishing  email.  By  processing  the  dataset  and  using
        measurable and observable features, it is continued by conducting various experiments to obtain satisfying






        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [117]
        Artificial Intelligence in the 4th Industrial Revolution
   125   126   127   128   129   130   131   132   133   134   135