Page 50 - The-5th-MCAIT2021-eProceeding
P. 50
“Exclude Length”, “ExcludePoS”, “ExcludeBoW”and “Exclude Prompt”. A set of pre-processed data will be
kept for comparison, referred to as “All features”. The 5 training datasets will be used to train the three models
separately. Then, we predict the scores based on the test sets. The predicted scores will be taken into the QWK
evaluation metric to compute the agreement between the human rater’s scores and the AES predicted scores.
4. Results and Discussion
4.1. QWK scores result for comparison
The QWK scores for the “All features” dataset and 4 “Exclude one feature group” datasets were computed
for the three trained models. The trained models' results are summarized in Table 1. “All feature group” shows
the BLRR model outperforming the rest of the models, which is in agreement with Phandi et al. (2015). For
“Exclude one feature group”, the most influential sets are bold-faced, and the least influential sets are
underlined. The length feature is the most influential feature group. However, the prompt feature seems to be
lacking. The QWK score of “Exclude prompt” in SVM and BLRR compared to “All features” show it is
overfitting the trained model. By overfitting, it means the prompt feature has worsened the models.
Table 1. Results for all EASE features and except one feature group.
QWK Score
Feature Group Features Used
NB SVM BLRR
All feature group All Features 0.517 0.601 0.626
Exclude one feature group Exclude Length 0.444 0.565 0.601
Exclude PoS 0.511 0.583 0.617
Exclude BoW 0.546 0.599 0.604
Exclude Prompt 0.494 0.636 0.657
The EASE function uses the Natural Language ToolKit (NLTK) to tokenize the essay topic into prompt
words. Subsequently, it finds the synonym of prompt words through the WordNet corpus in NLTK. Then, it
counts the synonym of prompt words and prompt words. We postulate that the reason for the prompt features
to be the least influential and to over fit is due to its weakness of extracting the semantic attributes. Semantic
attributes correspond to the contextual meaning of words or a set of words (Janda et al., 2019). It is crucial for
essay evaluation that the essay is written around a prompt or essay topic semantically (Norton, 1990). Hence,
we believe the EASE engine took into consideration of all PoS, which caused the prompt feature to overfit. PoS
such as conjunctions and adpositions do not bring any contextual meaning, which could add noise to the dataset.
Also, the method EASE applied to extract the semantic attributes is too brief and can be further improved in
the future. It only takes into consideration the separate words instead of a pair of words or sentences, which
makes it unable to capture context where a sentence or essay is starting to digress. As reported by
Miltsakaki&Kukich (2000), coherence between a pair of words or sentences is the key to make text semantically
meaningful.
5. Conclusion
We have experiments to investigate the weak point of the generic approach of feature engineering in AES.
We propose to compare the four types of features extracted from EASE by using the “Exclude one feature
group” datasets, then compare their QWK score with the “All features” set. As the comparison between the
sets, our work has shown that the prompt feature is the weakest feature among the four types of features. The
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [38]
Artificial Intelligence in the 4th Industrial Revolution