Page 61 - The-5th-MCAIT2021-eProceeding

P. 61

objective is to collect only text written in informal Malay. We disregard other characteristics to concentrate
entirely on the textual characteristics and the frequency with which informal Malay was used in social media
texts.
4. Analysis

According to our findings, the most frequently occurring words in this corpus are informal terms: aku (me/I;
887), ni (this; 548), tu (that; 537), nak (want; 514), dia (he/she/him; 389), tak (no; 374), la (358), dah (done;
281), yg (which; 274), and nk (want; 229). This data indicates that Malaysians are most likely to use informal
Malay language when writing social media texts. Neunerdt and Zesch (2016) identify conversational language,
dialogue styles, social media language, and informal writing as the primary characteristics of social media texts.
Our findings indicate that the language and structure of our corpus conform to all the characteristics mentioned
above. Thus, it demonstrates that, although our corpus is restricted to the words contained in tweets due to our
deliberate omission of additional tweet features, our corpus still accomplishes our study’s objective.
To aid researchers in better comprehending this corpus’s fundamental properties, we provide several
potentially useful statistics. The Twitter Advanced Search feature was used to retrieve 1,796 tweets written in
informal Malay. The final dataset includes 38,714 tokens and 5,387 different word types. Additionally, the
following are the values for fundamental n-grams, as well as their total number of tokens and types: unigrams
(token: 37,382; types: 8,150), bigrams (token: 37,381; types: 32,280), trigrams (token: 37,380; types: 36,940),
and 4-grams (token: 37,379; types: 37,207).

5. Conclusion

Overall, this work gathered tweets written in informal Malay language from Twitter. The data was pre-
processed minimally to eliminate duplicate tweets. The analysis of the corpus data reveals that informal terms
are the most frequently occurring words in this corpus. This analysis demonstrates that, despite the corpus data
being restricted to the words contained in tweets without any additional tweet features, the corpus achieves the
study’s objective. Numerous aspects of this work can be enhanced in the future. For instance, the corpus size
can be increased by extending the date ranges over which tweets are collected, and the keywords used to extract
related and relevant tweets can be improvised by learning new keywords from other new research or by learning
the variation for each word. This work is advantageous for those specializing in informal Malay and other
related fields.

Acknowledgements

Universiti Kebangsaan Malaysia partially funds this research work under the research grant code:
FRGS/1/2020/ICT02/UKM/02/1

References

Alshalabi, H., Tiun, S., Omar, N., & Albared, M. (2013). Experiments on the use of feature selection and
machine learning methods in automatic malay text categorization. Procedia Technology, 11, 748-754.
Anbananthen, K. S. M., Krishnan, J. K., Sayeed, M. S., & Muniapan, P. (2017). Comparison of stochastic and
rule-based POS tagging on Malay online text. American Journal of Applied Sciences, 14(9), 843-851.
Ariffin, S. N. A. N., & Tiun, S. (2018). Part-of-Speech Tagger for Malay Social Media Texts. GEMA

E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [48]
Artificial Intelligence in the 4th Industrial Revolution

56 57 58 59 60 61 62 63 64 65 66