Page 59 - The-5th-MCAIT2021-eProceeding
P. 59
Thus, this study aims to create a corpus of tweets written in the informal Malay language, encompassing
various dialects, conversational slang languages, and mixed languages. According to previous research, the
Malay Twitter corpus has existed since 2014. We discover the need for a corpus that incorporates dialect,
informal, colloquial, and mixed-language while poring over the existing Malay Twitter corpus data. The tweets
were gathered using Twitter’s Advanced Search function (Supian, Razak, & Bakar, 2017; Ariffin & Tiun, 2018;
Feizollah et al., 2019; Izazi & Tengku-Sepora, 2020) rather than the expensive, limited API (Feizollah et al.,
2019). This data collection, however, is limited to the words contained in tweets. We purposefully ignored and
omitted additional tweet features such as user information (full name & username), hashtags, URLs, and
timestamps because we deemed them superfluous. We contributed to data collection and extraction by using
various informal Malay languages from multiple Malaysia regions as keywords. Nevertheless, this dataset is not
available for public use or future research because we still enforce the dataset’s copyright.
The rest of this paper is structured in the following manner. Section 2 summarises pertinent works. Section
3 details the data collection and pre-processing procedures, while Section 4 present the corpus analysis and
Section 5 summarises this work.
2. Literature Review
Social media has established itself as a valuable resource for researchers seeking to collect and curate massive
amounts of data on a specific language or subject. Neunerdt and Zesch (2016) discovered that the primary
characteristics of social media texts could be broadly classified as conversational language, dialogue styles,
social media language, and informal writing. Conversational language is written language that transcribes
spoken or everyday language. We were surprised to discover many slang terms and other colloquial expressions
in a social media text, such as Twitter. According to Jamali (2018), Malaysian teenagers, on the other hand, are
incredibly inventive when it comes to inventing new expressions and creative spelling. The manner of speaking
(or dialogue styles) is also indicative of the writing style. For instance, the writer wishes to recount an event that
occurred to the reader (or other users). Social media’s language uses interaction signs such as emoticons,
interaction words, leetspeak, word transformations, and conversational language. Simultaneously, informal
writing contains errors such as spelling, abbreviation, sentence structure, and grammatical errors.
The Twitter application programming interface (API) is widely regarded as the de facto standard method for
researchers and developers to extract data from Twitter. This API enables researchers to locate, retrieve, interact
with, and create various resources, such as tweets, users, direct messages, lists, trends, media, and locations
(Twitter, Inc., n.d.-a). Numerous previous studies have collected tweets from Twitter using this API. For
instance, Maskat et al. (2020) retrieved and analyzed tweets about cyberbullying in Malaysia using the Twitter
API. They analyze cyberbullying text and devise a method for automatically classifying tweets as “bully” or
“not bully”. Bakar, Rahmat, and Othman (2019) published a similar study in which they used the Twitter API
to collect Malay tweets to conduct sentiment analysis and develop a polarity classification tool. In another work,
Xu and Zhang (2018) analyze tweets about the #MH370 tragedy to develop a model of crisis information sharing
based on sentiment, richness, authority, and relevance. They retrieved related tweets daily for 24 days using the
Twitter API. In contrast to the previously mentioned works, we collected tweets in this study by configuring the
Twitter Advanced Search feature (Twitter Help Center, 2021) to display only tweets containing the specified
keyword within the specified date ranges (Supian et al., 2017; Ariffin & Tiun, 2018; Feizollah et al., 2019; Izazi
& Tengku-Sepora, 2020). We chose this technique to avoid the API’s monthly fee, ranging from $149 to $2,499
and its restrictions on data requests, most notably the number of requests (Feizollah et al., 2019).
Since 2010, indigenous researchers have created a slew of Malay corpora. The majority of early work on the
Malay corpus (Don, 2010; Sidi et al., 2011; Mohamed, Omar, & Ab Aziz, 2011; Chung, 2011; Darwis,
Abdullah, & Idris, 2012; Alshalabi, Tiun, Omar, & Albared, 2013; Bukhari, Anuar, Khazin, & Abdul, 2015;
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [46]
Artificial Intelligence in the 4th Industrial Revolution