Page 215 - The-5th-MCAIT2021-eProceeding
P. 215
Review of Malay Named Entity Recognition
Hafsah , Saidah Saad , Lailatul Qadri Zakaria
a*
b
c
a,b,c Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia 43600 UKM Bangi,
Selangor, Malaysia
*Email: p93826@siswa.ukm.edu.my
Abstract
Named Entity Recognition (NER) is a technique used extensively to extract useful information from unstructured natural
language document collections. Named-Entity Recognition has important information extraction tasks that should be
developed for all languages in the world and almost all domains. Most of the research on NER has been done for English
languages. The Malay language NER cannot use the English corpus because there are differences in speech structure and
morphology between English and Malay. Based on discussion shows that the research of Malay named entity still in the
early stage.
Keywords: Named Entity Recognition; Natural Language Processing; Malay language; NER approach
1. Introduction
Named-Entity Recognition (NER) is a sub-part of Natural Language Processing (NLP) research which is
included in the field of Artificial Intelligence (AI). Named Entity Recognition (NER) is the initial step in
information extraction that seeks to find and classify entities mentioned in the text into predetermined
categories, such as the name of the person, organization, location, expression, time, amount, monetary value,
percentage, etc. (Saad & Mansor, 2018).
Currently, research related to NER has been carried out for various purposes and the methods used. The
methods used also vary, from rule-based to the use of Machine Learning (ML) (Saad & Mansor, 2018). The
rule-based approach uses defined rules based on linguistic knowledge with analysis carried out at the syntactic
and semantic levels (Goyal et al., 2018). This method has limitations because we have to define as many rules
as possible to get optimal results (Nadia & Omar, 2019). To overcome these limitations, we can use the ML
approach to study patterns from the data by only providing sufficient data sets (Salini et al., 2017).
An approach that is also widely used recently is to use deep learning (DL) to recognize patterns of entities
in sentences (Li et al., 2020). Named-Entity Recognition has important information extraction tasks that should
be developed for all languages in the world and almost all domains. However, these tasks differ according to
language, domain, and systems development approach (Patil et al., 2019).
Most of the Named Entity Recognition research focuses on English as well as European languages. But
along with the development of research in this field, more and more types of languages have been researched.
English and Japanese are well explored in MUC-6 [5] and earlier works. German, Dutch and Spanish is
discussed at the CONLL conference. Chinese is studied in an abundant literary language as well as French,
Greek and Italian. Arabic has started to receive a lot of attention in large-scale projects such as Global
Autonomous Language Exploitation (GALE). In time, Asian and several other languages were also considered
(Goyal et al., 2018).
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [200]
Artificial Intelligence in the 4th Industrial Revolution