Page 58 - The-5th-MCAIT2021-eProceeding
P. 58
Informal Malay Language Twitter Corpus
a*
b
Siti Noor Allia Noor Ariffin , Sabrina Tiun
a,b Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, 43600, Malaysia
* Email: sitinoorallia@gmail.com
Abstract
In Malaysia, Twitter is a popular social media platform. This platform enables microblogging with a maximum of 280
characters per tweet. Users tweet almost everything that occurs during a single day. Due to its popularity, most Malaysians
use Twitter daily, providing researchers and developers with abundant data on Malaysian users. This paper discusses how
this study constructed a new Malay Twitter corpus and analyzed the data collected. The purpose of this paper is to compile
tweets written in the informal Malay language. The data were extracted via Twitter’s search function using relevant and
related keywords associated with informal Malay language. The data was minimally pre-processed, as this study imposed
several constraints on the collected tweets. The corpus data analysis reveals that most of the words in this corpus are
informal, implying that Malaysians are most likely to write social media texts in informal Malay. This paper will benefit
social media researchers and developers, particularly those with expertise in informal Malay and related fields.
Keywords: Informal Malay language; Malay Twitter corpus, Malay tweets
1. Introduction
Twitter is a social networking site that provides an online microblogging service that enables users of all
backgrounds to send and read 280-character (Rosen & Ihara, 2017; Twitter, 2021) microblogs known as tweets.
Tweets can be about anything, from jokes to current events to dinner plans (Britannica, 2020). According to
Statista (2021), Twitter now has the most daily active users globally, surpassing the million-user mark in the
fourth quarter of 2020 and remaining above that mark. Additionally, Statista (2021) reported that Malaysia was
one of the top nations in the world in 2021, with approximately 3.35 million Twitter users. This analysis
demonstrates that most Malaysians use Twitter by scrolling through feeds, retweeting, or saving content to
retweet, which generates a wealth of data about Malaysian users. Twitter data can be easily collected using the
application programming interface (API) provides by Twitter (Twitter, Inc., n.d.-b). It is, however, limited to
the most recent seven days of data (Feizollah, Ainin, Anuar, Abdullah, & Hazim, 2019). To view tweets older
than seven days, a premium account, which costs hundreds of dollars, is required (Twitter, Inc., n.d.-b).
Furthermore, Twitter offers an Advanced Search feature that enables users to filter search results by date ranges,
people, and more (Twitter Help Center, 2021). As a result, this study uses the Advanced Search feature to collect
tweets containing informal Malay language and restricts the date range to February 2019.
The informal Malay language is a dialect of Malay used in everyday conversations by Malaysians. The
language contains a large number of informal terms, including accent (or dialect) words, slang, titles (e.g. hang,
mek), sounds (such as words written to express sounds like laughter, cat sounds, and knocking sounds), and
mixed languages. The term “regional dialect or language” refers to a group of people who speak the language
of a country state, resulting in word variation. By contrast, only a tiny percentage of the population understands
slang. The term “mixed language” refers to the use of a foreign language in conjunction with Malay. When users
write text on social media to provide reviews or opinions or tell a story, they frequently use everyday
conversational language to project a friendly, casual, and easy-going image to other users.
E- Proceedings of The 5th International Multi-Conference on Artificial Intelligence Technology (MCAIT 2021) [45]
Artificial Intelligence in the 4th Industrial Revolution