Page 217 - The-5th-MCAIT2021-eProceeding
P. 217
to do entity labeling on unlabeled data with data from DBpedia. For testing, they use the SVM algorithm for
labeled data modeling. The results obtained in this study are a precision value of 73.6%, a recall value of 80.1%,
and an F1 of 76.5%.
Bhasuran et al. (2016) have proposed a biomedical NER based on a stacked ensemble approach. The authors
applied several domain-specific, morphological, orthographical, and contextual features, Conditional Random
Fields (CRF) based modeling, and two fuzzy matching algorithms for extracting disease-named entities. Some
post-processing measures are also applied to enhance the performance of the model.
Salleh et al. (2017) propose that the Malay language NER uses the Python CRFsuite and several features.
The feature such as capitalization, lowercase, previous and closest words, digits, word forms, and word POS
tags, and others show the potential for increasing the accuracy of results from recognizing named entities Malay.
Salleh et al. (2018) proposed Malay NER using the fuzzy c-means method with the Rapid Miner software and
dataset from Bernama Malay news. The types of named entities analyzed are person, location, organization,
and facility. In conclusion, the overall percentage accuracy gave markedly good results based on clustering
matching with 98.57%.
2.3. Hybrid Named Entity Recognizers
A hybrid Named Entity recognition system combines both rule-based and machine learning techniques.
These new methods combine the strongest points from each method: the adaptability and flexibility from
machine learning approaches and rules to improve efficiency. Keretna et al. (2014) present a hybrid model
comprising the rule-based and lexicon-based techniques for extracting drug Named entity from the informal
and unstructured medical text. The experimental outcome indicates that integrating many valuable rules into a
lexicon-based technique can enhance the performance of the BioNER problem. The proposed model can
achieve an f-score of 66.97%.
Munkhjargal et al. (2015) have introduced a Mongolian named entity recognizer. The authors used statistical
techniques, namely Maximum Entropy, SVM, CRF, gazetteers, and string matching patterns, to handle the
vocabulary words. The optimal ensemble reached 90.59% precision, 85.88% recall, and 88.17% F1 score.
3. Issues and Challenges in Malay Named Entity Recognition
Most of the documents on a website are unstructured, making it difficult to get the relevant information in
structured data. Information extraction is the process of converting unstructured data into structured data. Thus
the extraction of named entities is a challenging task. Apart from the techniques used, several factors affect
NER tasks' performance, such as language factors, domain factors, entity type factors, etc. Several researchers
have researched the Malay language NER. Most of the research of NER in Malay uses a Rule-based approach
and a supervised system approach (Nadia & Omar, 2019).
The NER system's performance is highly dependent on some language resources such as POS tagger,
morphological analyzer, chunker, parser, etc. The Malay language has some similarities with English features
such as capitalization and word POS tag such as proper noun to recognize the entity (Morsidi et al., 2016). The
supervised named entity recognition system requires large annotated corporations to classify named entities
from the test data. The challenge because the Malay language corpus is still limited compared to the English
corpus. The Malay language NER cannot use the English corpus because there are differences in speech
structure and morphology between English and Malay (Nadia & Omar, 2019).
Domain factors have a significant influence on the Named Entity Recognition task. Various domains are
explored for NER assignments, such as news articles, crime, medical, etc.
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [202]
Artificial Intelligence in the 4th Industrial Revolution