Hate Words Detection Among Sri Lankan Social Media Text Messages

Shalinda, J. A. D. U.; Munasinghe, Lankeshwara

Please use this identifier to cite or link to this item: http://repository.kln.ac.lk/handle/123456789/25401

Title:	Hate Words Detection Among Sri Lankan Social Media Text Messages
Authors:	Shalinda, J. A. D. U. Munasinghe, Lankeshwara
Keywords:	English, hate speech detection, NLP, Romanized Sinhala, Sinhala
Issue Date:	2022
Publisher:	Department of Industrial Management, Faculty of Science, University of Kelaniya Sri Lanka
Citation:	Shalinda J. A. D. U.; Munasinghe Lankeshwara (2022), Hate Words Detection Among Sri Lankan Social Media Text Messages, International Research Conference on Smart Computing and Systems Engineering (SCSE 2022), Department of Industrial Management, Faculty of Science, University of Kelaniya Sri Lanka. 55-60.
Abstract:	The number of Sri Lankan social media users have been increased with the rapid growth of 23% between 2020 and 2021, reaching 7.9 million in 2021 January. Social media platforms became more popular when they started supporting native languages. The problems with social media also evolved as popularity grows. Social media platforms were banned for Sri Lankan users in 2019 to prevent the spreading of hate messages and incorrect information among citizens. The lack of automatically recognizing tools for hate messages in Sinhala and Romanized Sinhala was reported as the reason for the ban. It’s also a waste of time and money to manually identify them. Many studies have been conducted to identify hate messages in both English and Sinhala separately. Users in Sri Lanka tend to combine Sinhala, Romanized Sinhala, and English phrases while expressing their opinions.” Mama job ekakata apply kara,” for example. To train, an open-source data set which consists of 2500 comments, was used. And the comments were categorized as either hateful or non-hateful. To pre-process the data set, an Open-source stop word corpus and stem word corpus in Sinhala were utilized, and two corpus were manually converted into Romanized Sinhala stop word corpus and Romanized Sinhala stem word corpus to identify stop words and stem words in Romanized Sinhala. All English words were recognized using an open- source English word corpus, and a library was utilized to obtain stop word corpus and stem English words. As a result, doing research to identify hate speech in all of the languages indicated above will be more effective in reaching Sri Lankan users. The bag of words and term frequency-inverse document frequency were compared for feature engineering. Linear Support vector classifier, Random Forest Classification, SGD classifier, Logistic Regression, XGBoost classifier and multinomial Naive Bayes classifier are used as classification algorithms and evaluated. Using the SGD classification using TF-IDF with uni&bi-gram, the highest accuracy was determined to be 74.2%.
URI:	http://repository.kln.ac.lk/handle/123456789/25401
Appears in Collections:	Smart Computing and Systems Engineering - 2022 (SCSE 2022)

Files in This Item:

File	Description	Size	Format
SCSE 2022 09.pdf		118.62 kB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets