Capturing sentence-level positional data into N-gram profiles for document classification

Gunasekara, L. M. S.; Premadasa, H. K. S.

Please use this identifier to cite or link to this item: http://repository.kln.ac.lk/handle/123456789/26954

Full metadata record

DC Field	Value	Language
dc.contributor.author	Gunasekara, L. M. S.	-
dc.contributor.author	Premadasa, H. K. S.	-
dc.date.accessioned	2023-11-08T05:27:17Z	-
dc.date.available	2023-11-08T05:27:17Z	-
dc.date.issued	2023	-
dc.identifier.citation	Gunasekara L. M. S.; Premadasa H. K. S. (2023) Capturing sentence-level positional data into N-gram profiles for document classification., Proceedings of the International Conference on Applied and Pure Sciences (ICAPS 2023-Kelaniya) Volume 3, Faculty of Science, University of Kelaniya Sri Lanka. Page 119	en_US
dc.identifier.uri	http://repository.kln.ac.lk/handle/123456789/26954	-
dc.description.abstract	Document classification is a crucial aspect in natural language processing with a wide range of applications in various domains such as email spam filtering, hate speech detection, political bias assessment, etc. While modern transformer-based classification approaches have shown promising results in this area, they rely on expensive parallel processing hardware, leaving them out of reach for simpler applications. Therefore, it is still safe to assume that there is room for improvement in terms of developing approaches with lower computational complexity. N-grams are a simple and efficient way of representing text data as features based on the distribution of contiguous tokens within the text. This approach is widely used in text analysis and research due to its language independence and minimal pre-processing requirements. However, most of these models do not possess sentence-level positional information in their n-gram profiles. Hence, in this study, we propose a revised algorithm for generating n-gram profiles related to document categories in a classification task. We combine this new algorithm with the Euclidean distance metric to assign class labels for raw documents. This algorithm was evaluated on two main tasks: language classification and subject classification (in English). Our results show that this approach achieves accuracy levels comparable to state-of-the-art models. For the language classification task, we were able to showcase an accuracy of 91% on the WiLI Benchmark Dataset consisting of 235 languages in total with an average prediction time of 1.88 × 10−2 seconds. Furthermore, we investigated several configurations in the dimensions of n-gram range and n-gram cutoff length for the subject classification task. The best performing configuration of a fixed n-gram length of 5 and a cutoff length of 5000 assumes an accuracy of 50% with an average inference time of 3.29 × 10−2 seconds on the 20 Newsgroups Dataset spanning a whole of 20 newsgroups categories. Overall, our findings suggest that this approach of including sentence-level positional data in n-gram profiles can facilitate an algorithm of minimal complexity, and this algorithm, combined with a suitable n-gram range and cutoff level, can perform well for document classification, particularly when dealing with noisy data with similar categorical labels.	en_US
dc.publisher	Faculty of Science, University of Kelaniya Sri Lanka	en_US
dc.subject	Document Classification, N-Grams, Natural Language Processing, Language Classification, Subject Classification	en_US
dc.title	Capturing sentence-level positional data into N-gram profiles for document classification	en_US
Appears in Collections:	ICAPS 2023

Files in This Item:

File	Description	Size	Format
ICAPS 2023 119.pdf		210.22 kB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets