Page 60 - The-5th-MCAIT2021-eProceeding
P. 60
Hoogervost, 2015; Hijazi et al., 2016; Yeo & Ting, 2017) concentrated on developing corpora containing formal
Malay language content such as newspapers, speech texts, academy texts, and more. Nonetheless, as previously
stated, the study and development of the Malay Twitter corpus (Arshi Saloot, Idris, Aw, & Thorleuchter, 2014;
Omar et al., 2017; Anbananthen, Krishnan, Sayeed, & Muniapan, 2017; Ariffin & Tiun, 2018; Raja, Lay-Ki, &
Su-Cheng, 2019) began in 2014, with the corpus consisting primarily of tweets written in informal Malay
language. However, several previous studies’ corpora (Omar et al., 2017; Raja et al., 2019) also included
additional social media content, such as Facebook posts. Additionally, as mentioned earlier, we identify a need
for a new Malay Twitter corpus that includes dialect, informal, colloquial, and mixed-language usage.
According to our findings, the existing Malay Twitter corpus content does not possess the desired characteristics
for accomplishing the study’s objective. Thus, to achieve the purpose of this study, we retrieved tweets using
previously identified keywords (Hoogervost, 2015; Hashim et al., 2016; Jamali, 2018) relevant and related to
the informal Malay language. Furthermore, to ensure that we collected the correct and appropriate tweets, we
used the findings of several previous studies as a guide for identifying the structure and other informal Malay
languages (Kob, 2008; Hasrah & Aman, 2010; Sharum & Hamzah, 2011; Mansor, Mansor, & Rahim, 2013;
Harun & Yusof, 2015; Jalaluddin, 2015; Jamil & Yusof, 2015; Sahril, 2016; Subet & Daud, 2016; Omar et al.,
2017; Yeo & Ting, 2017; Choi & Chong, 2017; Jaafar, Aman, & Awal, 2017; Bakar & Mazzalan, 2018; Yusof,
2018; Wahab, 2018; Shafiee et al., 2019; Bakar & Tarmizi, 2019).
3. Methodology
This study aims to build a corpus of tweets written in the informal Malay language, including various
dialects, conversational slang languages, and mixed languages. The methodology proposed entails data
collection and pre-processing. As previously stated, the tweets were gathered using Twitter’s Advanced Search
feature. It searched using the provided keywords. Pre-processing is performed on the data by removing duplicate
tweets from the corpus.
We manually collected data for this study, using keywords relevant and related to informal Malay language
and limiting the date ranges to February 2019. The keywords were chosen after conducting a literature review
on informal Malay language and structure. As mentioned previously, Twitter’s standard method of collecting
tweets is via their application programming interface (API), enabling developers and researchers to collect data.
On the other hand, the API has numerous limitations, including a seven-day limit on tweets and a cap on the
number of requests to the Twitter server (Feizollah et al., 2019). Hence, we chose to gather data using Twitter’s
Advanced Search feature manually, and the previously mentioned limitations became irrelevant.
Moreover, as previously stated, this study’s data will be pre-processed by removing duplicates tweets from
the corpus. This study used minimal data pre-processing because the collected data consisted solely of the words
in the tweets and lacked additional tweet features such as user information (full name & username), hashtags,
URLs, and timestamps. The pre-processing of the data begins with the identification and removal of duplicate
tweets from the corpus. To begin, we sorted the data lexicographically ascending in order to identify any lines
with repeated tweets. The lines containing repeated tweets, colloquially referred to as duplicated tweets, were
then manually deleted from the corpus.
This study focuses exclusively on one criterion for tweet inclusion: tweets must be written in informal Malay.
As explained earlier, informal Malay is a language rich in informal terms such as dialect, slang, titles, sounds,
and mixed languages. Therefore, to ensure that the tweets we chose were appropriate and accurately reflected
our study’s objective, we used only keywords derived from previous research findings. On the contrary, this
study currently does not have any exclusion criteria for tweets. We compiled a list of all tweets that were
returned in response to the applied keywords. However, as mentioned earlier, we deliberately overlooked and
omitted several additional tweet features. Other tweet features were deemed superfluous, as the study’s
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [47]
Artificial Intelligence in the 4th Industrial Revolution