Page 59 - The-5th-MCAIT2021-eProceeding
P. 59

Thus, this study aims to create a corpus of tweets written in the informal Malay language, encompassing
        various dialects, conversational slang languages, and mixed languages.  According to previous research, the
        Malay Twitter corpus has existed since 2014. We discover  the need for a corpus that incorporates dialect,
        informal, colloquial, and mixed-language while poring over the existing Malay Twitter corpus data. The tweets
        were gathered using Twitter’s Advanced Search function (Supian, Razak, & Bakar, 2017; Ariffin & Tiun, 2018;
        Feizollah et al., 2019; Izazi & Tengku-Sepora, 2020) rather than the expensive, limited API (Feizollah et al.,
        2019). This data collection, however, is limited to the words contained in tweets. We purposefully ignored and
        omitted  additional  tweet  features  such  as  user  information  (full  name  &  username),  hashtags,  URLs,  and
        timestamps because we deemed them superfluous. We contributed to data collection and extraction by using
        various informal Malay languages from multiple Malaysia regions as keywords. Nevertheless, this dataset is not
        available for public use or future research because we still enforce the dataset’s copyright.
           The rest of this paper is structured in the following manner. Section 2 summarises pertinent works. Section
        3 details the data collection and pre-processing procedures, while Section 4 present the corpus analysis and
        Section 5 summarises this work.

        2. Literature Review

           Social media has established itself as a valuable resource for researchers seeking to collect and curate massive
        amounts of data on a specific language or subject.  Neunerdt and Zesch (2016) discovered that the primary
        characteristics of social media texts could be broadly classified as conversational language, dialogue styles,
        social  media  language,  and  informal  writing.  Conversational  language  is  written  language  that  transcribes
        spoken or everyday language. We were surprised to discover many slang terms and other colloquial expressions
        in a social media text, such as Twitter. According to Jamali (2018), Malaysian teenagers, on the other hand, are
        incredibly inventive when it comes to inventing new expressions and creative spelling. The manner of speaking
        (or dialogue styles) is also indicative of the writing style. For instance, the writer wishes to recount an event that
        occurred  to  the  reader  (or  other  users).  Social  media’s  language  uses  interaction  signs  such  as  emoticons,
        interaction  words,  leetspeak,  word  transformations,  and  conversational  language.  Simultaneously,  informal
        writing contains errors such as spelling, abbreviation, sentence structure, and grammatical errors.
           The Twitter application programming interface (API) is widely regarded as the de facto standard method for
        researchers and developers to extract data from Twitter. This API enables researchers to locate, retrieve, interact
        with, and create various resources, such as tweets, users, direct messages, lists, trends, media, and locations
        (Twitter,  Inc.,  n.d.-a).  Numerous  previous  studies  have  collected  tweets  from  Twitter  using  this  API.  For
        instance, Maskat et al. (2020) retrieved and analyzed tweets about cyberbullying in Malaysia using the Twitter
        API. They analyze cyberbullying text and devise a method for automatically classifying tweets as “bully” or
        “not bully”. Bakar, Rahmat, and Othman (2019) published a similar study in which they used the Twitter API
        to collect Malay tweets to conduct sentiment analysis and develop a polarity classification tool. In another work,
        Xu and Zhang (2018) analyze tweets about the #MH370 tragedy to develop a model of crisis information sharing
        based on sentiment, richness, authority, and relevance. They retrieved related tweets daily for 24 days using the
        Twitter API. In contrast to the previously mentioned works, we collected tweets in this study by configuring the
        Twitter Advanced Search feature (Twitter Help Center, 2021) to display only tweets containing the specified
        keyword within the specified date ranges (Supian et al., 2017; Ariffin & Tiun, 2018; Feizollah et al., 2019; Izazi
        & Tengku-Sepora, 2020). We chose this technique to avoid the API’s monthly fee, ranging from $149 to $2,499
        and its restrictions on data requests, most notably the number of requests (Feizollah et al., 2019).
           Since 2010, indigenous researchers have created a slew of Malay corpora. The majority of early work on the
        Malay  corpus  (Don,  2010;  Sidi  et  al.,  2011;  Mohamed,  Omar,  &  Ab  Aziz,  2011;  Chung,  2011;  Darwis,
        Abdullah, & Idris, 2012; Alshalabi, Tiun, Omar, & Albared, 2013; Bukhari, Anuar, Khazin, & Abdul, 2015;







        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [46]
        Artificial Intelligence in the 4th Industrial Revolution
   54   55   56   57   58   59   60   61   62   63   64