Page 60 - The-5th-MCAIT2021-eProceeding
P. 60

Hoogervost, 2015; Hijazi et al., 2016; Yeo & Ting, 2017) concentrated on developing corpora containing formal
        Malay language content such as newspapers, speech texts, academy texts, and more. Nonetheless, as previously
        stated, the study and development of the Malay Twitter corpus (Arshi Saloot, Idris, Aw, & Thorleuchter, 2014;
        Omar et al., 2017; Anbananthen, Krishnan, Sayeed, & Muniapan, 2017; Ariffin & Tiun, 2018; Raja, Lay-Ki, &
        Su-Cheng, 2019) began in 2014, with the corpus consisting primarily of tweets  written in  informal Malay
        language.  However,  several  previous  studies’  corpora  (Omar  et  al.,  2017;  Raja  et  al.,  2019)  also  included
        additional social media content, such as Facebook posts. Additionally, as mentioned earlier, we identify a need
        for  a  new  Malay  Twitter  corpus  that  includes  dialect,  informal,  colloquial,  and  mixed-language  usage.
        According to our findings, the existing Malay Twitter corpus content does not possess the desired characteristics
        for accomplishing the study’s objective. Thus, to achieve the purpose of this study, we retrieved tweets using
        previously identified keywords (Hoogervost, 2015; Hashim et al., 2016; Jamali, 2018) relevant and related to
        the informal Malay language. Furthermore, to ensure that we collected the correct and appropriate tweets, we
        used the findings of several previous studies as a guide for identifying the structure and other informal Malay
        languages (Kob, 2008; Hasrah & Aman, 2010; Sharum & Hamzah, 2011; Mansor, Mansor, & Rahim, 2013;
        Harun & Yusof, 2015; Jalaluddin, 2015; Jamil & Yusof, 2015; Sahril, 2016; Subet & Daud, 2016; Omar et al.,
        2017; Yeo & Ting, 2017; Choi & Chong, 2017; Jaafar, Aman, & Awal, 2017; Bakar & Mazzalan, 2018; Yusof,
        2018; Wahab, 2018; Shafiee et al., 2019; Bakar & Tarmizi, 2019).


        3. Methodology
           This  study  aims  to  build  a  corpus  of  tweets  written  in  the  informal  Malay  language,  including  various
        dialects,  conversational  slang  languages,  and  mixed  languages.  The  methodology  proposed  entails  data
        collection and pre-processing. As previously stated, the tweets were gathered using Twitter’s Advanced Search
        feature. It searched using the provided keywords. Pre-processing is performed on the data by removing duplicate
        tweets from the corpus.
           We manually collected data for this study, using keywords relevant and related to informal Malay language
        and limiting the date ranges to February 2019. The keywords were chosen after conducting a literature review
        on informal Malay language and structure. As mentioned previously, Twitter’s standard method of collecting
        tweets is via their application programming interface (API), enabling developers and researchers to collect data.
        On the other hand, the API has numerous limitations, including a seven-day limit on tweets and a cap on the
        number of requests to the Twitter server (Feizollah et al., 2019). Hence, we chose to gather data using Twitter’s
        Advanced Search feature manually, and the previously mentioned limitations became irrelevant.
           Moreover, as previously stated, this study’s data will be pre-processed by removing duplicates tweets from
        the corpus. This study used minimal data pre-processing because the collected data consisted solely of the words
        in the tweets and lacked additional tweet features such as user information (full name & username), hashtags,
        URLs, and timestamps. The pre-processing of the data begins with the identification and removal of duplicate
        tweets from the corpus. To begin, we sorted the data lexicographically ascending in order to identify any lines
        with repeated tweets. The lines containing repeated tweets, colloquially referred to as duplicated tweets, were
        then manually deleted from the corpus.
           This study focuses exclusively on one criterion for tweet inclusion: tweets must be written in informal Malay.
        As explained earlier, informal Malay is a language rich in informal terms such as dialect, slang, titles, sounds,
        and mixed languages. Therefore, to ensure that the tweets we chose were appropriate and accurately reflected
        our study’s objective, we used only keywords derived from previous research findings. On the contrary, this
        study currently does not have any exclusion criteria for tweets. We compiled a list of all tweets that were
        returned in response to the applied keywords. However, as mentioned earlier, we deliberately overlooked and
        omitted  several  additional  tweet  features.  Other  tweet  features  were  deemed  superfluous,  as  the  study’s







        E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021)   [47]
        Artificial Intelligence in the 4th Industrial Revolution
   55   56   57   58   59   60   61   62   63   64   65