Page 58 - The-5th-MCAIT2021-eProceeding
P. 58

Informal Malay Language Twitter Corpus


                                                           a*
                                                                           b
                             Siti Noor Allia Noor Ariffin , Sabrina Tiun
               a,b  Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, 43600, Malaysia
                                          * Email: sitinoorallia@gmail.com


        Abstract

        In Malaysia, Twitter is a popular social media platform. This platform enables microblogging with a maximum of 280
        characters per tweet. Users tweet almost everything that occurs during a single day. Due to its popularity, most Malaysians
        use Twitter daily, providing researchers and developers with abundant data on Malaysian users. This paper discusses how
        this study constructed a new Malay Twitter corpus and analyzed the data collected. The purpose of this paper is to compile
        tweets written in the informal Malay language. The data were extracted via Twitter’s search function using relevant and
        related keywords associated with informal Malay language. The data was minimally pre-processed, as this study imposed
        several constraints on the collected tweets.  The corpus data analysis reveals that  most of the words in this corpus are
        informal, implying that Malaysians are most likely to write social media texts in informal Malay. This paper will benefit
        social media researchers and developers, particularly those with expertise in informal Malay and related fields.

        Keywords: Informal Malay language; Malay Twitter corpus, Malay tweets


        1. Introduction

           Twitter is a social networking site that provides an online microblogging service that enables users of all
        backgrounds to send and read 280-character (Rosen & Ihara, 2017; Twitter, 2021) microblogs known as tweets.
        Tweets can be about anything, from jokes to current events to dinner plans (Britannica, 2020). According to
        Statista (2021), Twitter now has the most daily active users globally, surpassing the million-user mark in the
        fourth quarter of 2020 and remaining above that mark. Additionally, Statista (2021) reported that Malaysia was
        one  of  the  top  nations  in  the  world  in  2021,  with  approximately  3.35  million  Twitter  users.  This  analysis
        demonstrates that most Malaysians use Twitter by scrolling through feeds, retweeting, or saving content to
        retweet, which generates a wealth of data about Malaysian users. Twitter data can be easily collected using the
        application programming interface (API) provides by Twitter (Twitter, Inc., n.d.-b). It is, however, limited to
        the most recent seven days of data (Feizollah, Ainin, Anuar, Abdullah, & Hazim, 2019). To view tweets older
        than  seven  days,  a  premium  account,  which  costs  hundreds  of  dollars,  is  required  (Twitter,  Inc.,  n.d.-b).
        Furthermore, Twitter offers an Advanced Search feature that enables users to filter search results by date ranges,
        people, and more (Twitter Help Center, 2021). As a result, this study uses the Advanced Search feature to collect
        tweets containing informal Malay language and restricts the date range to February 2019.
           The informal Malay language is a dialect of Malay used in everyday conversations by Malaysians.  The
        language contains a large number of informal terms, including accent (or dialect) words, slang, titles (e.g. hang,
        mek), sounds (such as words written to express sounds like laughter, cat sounds, and knocking sounds), and
        mixed languages. The term “regional dialect or language” refers to a group of people who speak the language
        of a country state, resulting in word variation. By contrast, only a tiny percentage of the population understands
        slang. The term “mixed language” refers to the use of a foreign language in conjunction with Malay. When users
        write  text  on  social  media  to  provide  reviews  or  opinions  or  tell  a  story,  they  frequently  use  everyday
        conversational language to project a friendly, casual, and easy-going image to other users.







        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [45]
        Artificial Intelligence in the 4th Industrial Revolution
   53   54   55   56   57   58   59   60   61   62   63