Capturing sentence-level positional data into N-gram profiles for document classification

Gunasekara, L. M. S.; Premadasa, H. K. S.

UoK Repository Home
→
Science
→
Symposia & Conferences
→
International Conference on Applied and Pure Sciences (ICAPS)
→
ICAPS 2023
→
View Item

dc.contributor.author	Gunasekara, L. M. S.
dc.contributor.author	Premadasa, H. K. S.
dc.date.accessioned	2023-11-08T05:27:17Z
dc.date.available	2023-11-08T05:27:17Z
dc.date.issued	2023
dc.identifier.citation	Gunasekara L. M. S.; Premadasa H. K. S. (2023) Capturing sentence-level positional data into N-gram profiles for document classification., Proceedings of the International Conference on Applied and Pure Sciences (ICAPS 2023-Kelaniya) Volume 3, Faculty of Science, University of Kelaniya Sri Lanka. Page 119	en_US
dc.identifier.uri	http://repository.kln.ac.lk/handle/123456789/26954
dc.description.abstract	Document classification is a crucial aspect in natural language processing with a wide range of applications in various domains such as email spam filtering, hate speech detection, political bias assessment, etc. While modern transformer-based classification approaches have shown promising results in this area, they rely on expensive parallel processing hardware, leaving them out of reach for simpler applications. Therefore, it is still safe to assume that there is room for improvement in terms of developing approaches with lower computational complexity. N-grams are a simple and efficient way of representing text data as features based on the distribution of contiguous tokens within the text. This approach is widely used in text analysis and research due to its language independence and minimal pre-processing requirements. However, most of these models do not possess sentence-level positional information in their n-gram profiles. Hence, in this study, we propose a revised algorithm for generating n-gram profiles related to document categories in a classification task. We combine this new algorithm with the Euclidean distance metric to assign class labels for raw documents. This algorithm was evaluated on two main tasks: language classification and subject classification (in English). Our results show that this approach achieves accuracy levels comparable to state-of-the-art models. For the language classification task, we were able to showcase an accuracy of 91% on the WiLI Benchmark Dataset consisting of 235 languages in total with an average prediction time of 1.88 × 10−2 seconds. Furthermore, we investigated several configurations in the dimensions of n-gram range and n-gram cutoff length for the subject classification task. The best performing configuration of a fixed n-gram length of 5 and a cutoff length of 5000 assumes an accuracy of 50% with an average inference time of 3.29 × 10−2 seconds on the 20 Newsgroups Dataset spanning a whole of 20 newsgroups categories. Overall, our findings suggest that this approach of including sentence-level positional data in n-gram profiles can facilitate an algorithm of minimal complexity, and this algorithm, combined with a suitable n-gram range and cutoff level, can perform well for document classification, particularly when dealing with noisy data with similar categorical labels.	en_US
dc.publisher	Faculty of Science, University of Kelaniya Sri Lanka	en_US
dc.subject	Document Classification, N-Grams, Natural Language Processing, Language Classification, Subject Classification	en_US
dc.title	Capturing sentence-level positional data into N-gram profiles for document classification	en_US