Page 54 - The-5th-MCAIT2021-eProceeding

P. 54

and the comparison showed that LSTM is still a competitor where it achieved an accuracy of 74.63%. The
authors have justified such a miscarriage of BERT regarding a problem known as ‘catastrophic forgotten’ where
the BERT architecture would forget quickly what it had learnt previously. Similarly, Mayfield and Black (2020)
have proposed a BERT architecture for the AES task. The authors have utilized the pretrained BERT embedding
and then apply the fine-tune. Using ASAP dataset, results of accuracy showed an average of 64.6% achieved by
the proposed BERT.

3. Proposed Adjusted BERT

BERT is an advanced deep neural network architecture that is based on a transformer which is intended to
encode text and attempt to learn its deep linguistic context (Devlin, Chang, Lee, & Toutanova, 2018). BERT
architecture consists of two main models including pretraining and fine-tuning as shown in Fig. 2. The
pretraining contains a masked language model where some tokens within the text is being masked and the target
is to predict these masks. In addition, the pretraining contains a sentence prediction where the sentences of a
text document are being processed as input and the output is a binary classification of whether these sentences
are consecutive or not. On the other hand, the fine-tuning model in BERT aims at accommodating specific task
such as question answering, document classification or document ranking. In this study, the aim of fine-tuning
is to predict the score of an answer therefore, it would be document ranking.
3.1. Unfreezing Adjustment

In fact, the fine-tuning part of BERT has a remarkable drawback of forgetting contextual information. An
attempt to solve this problem has been depicted in the study of Howard and Ruder (2018) where an unfreezing
mechanism is used to adjust latter hidden layers to fit a particular task. To understand the unfreezing mechanism,
let assume multiple hidden layers within the fine-tuning architecture as shown in Fig. 1. The earliest layers
would depict learning general features such as relationships between embedding vectors. However, the approach
toward further hidden layers would require learning very specific characteristics of the particular task.
Therefore, rather than using the fine-tuning BERT architecture as it is like the studies of Rodriguez et al. (2019)
and Mayfield and Black (2020), it is necessary to examine a suitable adjustment. To this end, the learning rate
values of the latter hidden layers’ gradients will witness a gradual unfreezing. This can be seen as a gradual
increment of learning rates within the latter hidden layers’ gradients.

H1 H2 H3 Hk-1 Hk
Lr …….. Lr + i Lr + i

Lr …….. Lr + 2i Lr + 2i

……..
Lr …….. …….. …….. …….. Lr + xi …….. …….. Lr + xi

Freezing Unfreezing

Fig. 1. Arbitrary hidden layers within BERT fine-tuning

E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [42]
Artificial Intelligence in the 4th Industrial Revolution

49 50 51 52 53 54 55 56 57 58 59