Page 74 - The-5th-MCAIT2021-eProceeding
P. 74

3.2. Lexicon

           The Iraqi dialects are rich in lexicon, as many words are borrowed from the Mesopotamia civilization and
        languages from other neighboring countries, such as Turkish, Persian, Assyrian, and other Arabic dialects. for
        example, the word "همانزور, rwznAmh /calendar " barrowed from Persian while the word "هكوش اخḫʾašwkh/spoon
        " from Turkish.

        4. The Development of Iraqi Corpus

           Iraqi dialects have increasingly been used in Social media which provide a good source of data for NLP
        tasks. We have developed an annotated morphological Iraqi corpus called Al-Rafidayn which contains 3,000
        sentences. To develop a morphology corpus in Iraqi dialects, four stages have been adopted. In the first stage,
        annotation labeling is done by online web survey. Arabic Buckwalter transliteration and Lemma are used to
        recognize the Iraqi Indo-Iranian letters in the second stage. The third stage utilized morphological online tools
        to annotate grammatical properties. Finally, the fourth stage is an agreement which was reached with annotators
        to correct the error and verify the quality of the previous stages.

        5. Methodology

           This work aims to identify a set of features to improve Iraqi dialects classification. The study has adopted a
        variety of feature extraction and machine learning-based models using Multinomial Naive Bayes (MNB) in
        terms of training and testing.

        5.1. Features Extraction

           Extracting a set of discriminative features from the data helps in distinguishing the different classes. This
        study aims to extract specific Iraqi morphological and lexical handcrafted features to distinguish the Iraqi sub-
        dialects from the Iraqi annotated morphological corpus. This corpus is developed to include the inflection and
        diacritics due to the Iraqi dialects that use them as characters.  For example, the word “ِ  كل, laki” which means
        "is yours", the diacritic kasra is replaced by the letter "ya, ي" to become “يكلا, laky” in the MOS dialect to
        express the feminine pronoun, while in BAG and BAS they tend to onvert the letter “kaf and kasra” to the letter
        “Jim,ج” to becomes “جلا, laǧ”. This may aid the feature extraction process to utilize a variety of linguistically
        motivated feature sets, namely morphological content. The features considered are shown in Table 1.

        Table 1. Feature extracted for Iraqi Arabic Dialect

                 Features                                   Description
          Special Character    The Special 5 Iraqi letters (Indo-Iranian letter) along with possible derivational inflections
                               nouns, Number of words, proper nouns, adjectives, adverbs, Number of Pronouns, verbs, particles,
          POS features
                               prepositions, abbreviations, punctuation, conjunctions, interjections, foreign letters.
          Case features        nominative, accusative, and genitive.
          Gender features      Feminine and Masculine.
          Number features      singular words, plural words, and dual words.
          Grammatical person features   1st person, 2nd person, 3rd person.

        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [61]
        Artificial Intelligence in the 4th Industrial Revolution
   69   70   71   72   73   74   75   76   77   78   79