Page 54 - The-5th-MCAIT2021-eProceeding
P. 54

and the comparison showed that LSTM is still a competitor where it achieved an accuracy of 74.63%. The
        authors have justified such a miscarriage of BERT regarding a problem known as ‘catastrophic forgotten’ where
        the BERT architecture would forget quickly what it had learnt previously. Similarly, Mayfield and Black (2020)
        have proposed a BERT architecture for the AES task. The authors have utilized the pretrained BERT embedding
        and then apply the fine-tune. Using ASAP dataset, results of accuracy showed an average of 64.6% achieved by
        the proposed BERT.

        3. Proposed Adjusted BERT

           BERT is an advanced deep neural network architecture that is based on a transformer which is intended to
        encode text and attempt to learn its deep linguistic context (Devlin, Chang, Lee, & Toutanova, 2018). BERT
        architecture  consists  of  two  main  models  including  pretraining  and  fine-tuning  as  shown  in  Fig.  2.  The
        pretraining contains a masked language model where some tokens within the text is being masked and the target
        is to predict these masks. In addition, the pretraining contains a sentence prediction where the sentences of a
        text document are being processed as input and the output is a binary classification of whether these sentences
        are consecutive or not. On the other hand, the fine-tuning model in BERT aims at accommodating specific task
        such as question answering, document classification or document ranking. In this study, the aim of fine-tuning
        is to predict the score of an answer therefore, it would be document ranking.
        3.1. Unfreezing Adjustment

           In fact, the fine-tuning part of BERT has a remarkable drawback of forgetting contextual information. An
        attempt to solve this problem has been depicted in the study of Howard and Ruder (2018) where an unfreezing
        mechanism is used to adjust latter hidden layers to fit a particular task. To understand the unfreezing mechanism,
        let assume multiple hidden layers within the fine-tuning architecture as shown in Fig. 1. The earliest layers
        would depict learning general features such as relationships between embedding vectors. However, the approach
        toward  further  hidden  layers  would  require  learning  very  specific  characteristics  of  the  particular  task.
        Therefore, rather than using the fine-tuning BERT architecture as it is like the studies of Rodriguez et al. (2019)
        and Mayfield and Black (2020), it is necessary to examine a suitable adjustment. To this end, the learning rate
        values of the latter hidden layers’ gradients will witness a gradual unfreezing. This can be seen as a gradual
        increment of learning rates within the latter hidden layers’ gradients.

                          H1        H2       H3                      Hk-1     Hk
                     Lr                              ……..     Lr + i             Lr + i


                     Lr                              ……..     Lr + 2i            Lr + 2i






                                                       ……..
                     Lr    ……..      ……..     ……..   ……..    Lr + xi   ……..   ……..   Lr + xi

                                  Freezing                             Unfreezing


        Fig. 1. Arbitrary hidden layers within BERT fine-tuning






        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [42]
        Artificial Intelligence in the 4th Industrial Revolution
   49   50   51   52   53   54   55   56   57   58   59