Page 61 - The-5th-MCAIT2021-eProceeding
P. 61

objective is to collect only text written in informal Malay. We disregard other characteristics to concentrate
        entirely on the textual characteristics and the frequency with which informal Malay was used in social media
        texts.
        4. Analysis


           According to our findings, the most frequently occurring words in this corpus are informal terms: aku (me/I;
        887), ni (this; 548), tu (that; 537), nak (want; 514), dia (he/she/him; 389), tak (no; 374), la (358), dah (done;
        281), yg (which; 274), and nk (want; 229). This data indicates that Malaysians are most likely to use informal
        Malay language when writing social media texts. Neunerdt and Zesch (2016) identify conversational language,
        dialogue styles, social media language, and informal writing as the primary characteristics of social media texts.
        Our findings indicate that the language and structure of our corpus conform to all the characteristics mentioned
        above. Thus, it demonstrates that, although our corpus is restricted to the words contained in tweets due to our
        deliberate omission of additional tweet features, our corpus still accomplishes our study’s objective.
           To  aid  researchers  in  better  comprehending  this  corpus’s  fundamental  properties,  we  provide  several
        potentially useful statistics. The Twitter Advanced Search feature was used to retrieve 1,796 tweets written in
        informal Malay. The final dataset includes 38,714 tokens and 5,387 different word types. Additionally, the
        following are the values for fundamental n-grams, as well as their total number of tokens and types: unigrams
        (token: 37,382; types: 8,150), bigrams (token: 37,381; types: 32,280), trigrams (token: 37,380; types: 36,940),
        and 4-grams (token: 37,379; types: 37,207).

        5. Conclusion

           Overall, this work gathered tweets written in informal Malay language from Twitter. The data was pre-
        processed minimally to eliminate duplicate tweets. The analysis of the corpus data reveals that informal terms
        are the most frequently occurring words in this corpus. This analysis demonstrates that, despite the corpus data
        being restricted to the words contained in tweets without any additional tweet features, the corpus achieves the
        study’s objective. Numerous aspects of this work can be enhanced in the future. For instance, the corpus size
        can be increased by extending the date ranges over which tweets are collected, and the keywords used to extract
        related and relevant tweets can be improvised by learning new keywords from other new research or by learning
        the variation for each word. This work is advantageous for those specializing in informal Malay and other
        related fields.


        Acknowledgements

               Universiti  Kebangsaan  Malaysia  partially  funds  this  research  work  under  the  research  grant  code:
        FRGS/1/2020/ICT02/UKM/02/1


        References

        Alshalabi, H., Tiun, S., Omar, N., & Albared, M. (2013). Experiments on the use of feature selection and
        machine learning methods in automatic malay text categorization. Procedia Technology, 11, 748-754.
        Anbananthen, K. S. M., Krishnan, J. K., Sayeed, M. S., & Muniapan, P. (2017). Comparison of stochastic and
        rule-based POS tagging on Malay online text. American Journal of Applied Sciences, 14(9), 843-851.
        Ariffin, S. N. A. N., & Tiun, S. (2018). Part-of-Speech Tagger for Malay Social Media Texts. GEMA






        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [48]
        Artificial Intelligence in the 4th Industrial Revolution
   56   57   58   59   60   61   62   63   64   65   66