Page 72 - The-5th-MCAIT2021-eProceeding
P. 72
Analyzing Iraqi Dialects Unique Features for Dialect
Identification
a*
c
b
Ali Abdulraheem , Lailatul Qadri Zakaria , Nazlia Omar
a,b,c Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, University Kebangsaan
Malaysia 43600 Bangi, Selangor Darul Ehsan, Malaysia
a
*Email: aaj8068@gmail.com
Abstract
With the dramatic expansion of textual information, language identification has emerged as a task for analyzing such a huge
amount of text. Dialect identification is a sub-task of language identification where a particular language and its sub-dialects
are being addressed. This paper provides a series of features for improving the classification of Iraqi Arabic sub-dialects. It
makes an effort to resolve the issue of sentence-level fine-grained Iraqi Arabic Dialects Identification of three distinct sub-
dialects (Baghdadi, Maslawi, and Basrawi). Iraqi Arabic Dialects Recognition is a dynamic process in which other languages
have common traits, such as having the same character and vocabulary. This paper aims to investigate an extensive space
of features for identifying Iraqi Arabic sub-dialects by exploring a variety of feature extraction techniques such as (Special
Character, POS features, Grammatical individual features, Case features, Gender features, Number features), as well as
Machine learning-based models utilizing Multinomial Naive Bayes (MNB). However, this is the first preliminary analysis
for Iraqi Arabic sub-dialects, which have not yet been interested in computational linguistics.
Keywords: Iraqi Arabic; Arabic morphology; Dialectal Arabic
1. Introduction
Arabic is one of the world's oldest languages it has been evolving over the decades. Arabic language can be
classified into three categories: modern standard Arabic (MSA), classical Arabic (CA), and Arabic dialects
(AD). MSA is formally used in official platforms including educational institutes, television broadcasts, and
newspapers. CA is the language of the Holy Quran and Hadiths. It can also be viewed as the language of pre-
Islamic poets. AD is the combination of different Arabic dialects spoken in different Arab countries. Such
dialects have no written background, and they are formed by accommodating the varying degree of accents used
in different cultures (Belkredim and Sebai 2009). Arab people use AD more than MSA in their everyday lives.
AD is different from the CA and MSA in terms of morphology, phonology, lexicon, and syntax (Janet 2007).
Different varieties of ADs are posing significant challenges for natural language processing tasks such as
sentiment analysis, opinion mining, author profiling, and machine translation.
2. Related Work and Background
Arabic is known as a morphologically rich and complex language, which presents significant challenges for
dialect identification. Arabic dialect identification is a crucial topic for most Arabic NLP research because of
the diversity of the Arabic dialects. Some ADs in the same country shared features such as characters,
vocabulary, and basic language set making, that amplifies the complexity of the dialect identification task.
Some studies have used different methods such as game-based theory (Alshutayri& Atwell 2018a; Osman
et al. 2016) to automatically identify dialect in Arabic text. Bouamor et al. (2019), proposed a simple
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [59]
Artificial Intelligence in the 4th Industrial Revolution