Language identification at word level in Sinhala-English code-mixed social media text

Shanmugalingam, K.; Sumathipala, S.

Please use this identifier to cite or link to this item: http://repository.kln.ac.lk/handle/123456789/20164

Full metadata record

DC Field	Value	Language
dc.contributor.author	Shanmugalingam, K.	-
dc.contributor.author	Sumathipala, S.	-
dc.date.accessioned	2019-05-13T04:24:34Z	-
dc.date.available	2019-05-13T04:24:34Z	-
dc.date.issued	2019	-
dc.identifier.citation	Shanmugalingam, K., Sumathipala, S. (2019). Language identification at word level in Sinhala-English code-mixed social media text. IEEE International Research Conference on Smart computing & Systems Engineering (SCSE) 2019, Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka.P.113	en_US
dc.identifier.uri	http://repository.kln.ac.lk/handle/123456789/20164	-
dc.description.abstract	Automatic analyzing and extracting useful information from the noisy social media content are currently getting attention from the research community. It is common to find people easily mixing their native language along with the English language to express their thoughts in social media, using Unicode characters or the Unicode characters written in Roman Scripts. Thus these types of noisy code-mixed text are characterized by a high percentage of spelling mistakes with phonetic typing, wordplay, creative spelling, abbreviations, Meta tags, and so on. Identification of languages at word level become a necessary part for analyzing the noisy content in social media. It would be used as an intimidate language identifier for chatbot application by using the native languages. For this study we used Sinhala-English codemixed text from social media. Natural Language Processing (NLP) and Machine Learning (ML) technologies are used to identify the language tags at the word level. A novel approach proposed for this system implemented is machine learning classifier based on features such as Sinhala Unicode characters written in Roman scripts, dictionaries, and term frequency. Different machine learning classifiers such as Support Vector Machines (SVM), Naive Bayes, Logistic Regression, Random Forest and Decision Trees were used in the evaluation process. Among them, the highest accuracy of 90.5% was obtained when using Random Forest classifier	en_US
dc.language.iso	en	en_US
dc.publisher	IEEE International Research Conference on Smart computing & Systems Engineering (SCSE) 2019, Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka	en_US
dc.subject	Code-mixing	en_US
dc.subject	Language identification	en_US
dc.subject	Machine learning	en_US
dc.subject	Natural Language Processing (NLP)	en_US
dc.title	Language identification at word level in Sinhala-English code-mixed social media text	en_US
dc.type	Article	en_US
Appears in Collections:	Smart computing & Systems Engineering - (SCSE - 2019)

Files in This Item:

File	Description	Size	Format
SC-1 (18).pdf		2.01 MB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets