Please use this identifier to cite or link to this item: http://repository.kln.ac.lk/handle/123456789/27346
Title: Impact of Feature Selection Towards Short Text Classification
Authors: Jayakody, J.R.K.C.
Vidanagama, V.G.T.N.
Perera, Indika
Herath, H.M.L.K.
Keywords: classification, feature selection, n grams, document frequency
Issue Date: 2023
Publisher: Department of Industrial Management, Faculty of Science, University of Kelaniya Sri Lanka
Citation: Jayakody J.R.K.C.; Vidanagama V.G.T.N.; Perera Indika; Herath H.M.L.K. (2023), Impact of Feature Selection Towards Short Text Classification, International Research Conference on Smart Computing and Systems Engineering (SCSE 2023), Department of Industrial Management, Faculty of Science, University of Kelaniya Sri Lanka. Page 8
Abstract: Feature selection technique is used in text classification pipeline to reduce the number of redundant or irrelevant features. Moreover, feature selection algorithms help to decrease the overfitting, reduce training time, and improve the accuracy of the build models. Similarly, feature reduction techniques based on frequencies support eliminating unwanted features. Most of the existing work related to feature selection was based on general text and the behavior of feature selection was not evaluated properly with short text type dataset. Therefore, this research was conducted to investigate how performance varied with selected features from feature selection algorithms with short text type datasets. Three publicly available datasets were selected for the experiment. Chi square, info gain and f measure were examined as those algorithms were identified as the best algorithms to select features for text classification. Moreover, we examined the impact of those algorithms when selecting different types of features such as 1-gram and 2-gram. Finally, we look at the impact of frequency-based feature reduction techniques with the selected dataset. Our results showed that info gain algorithm outperform other two algorithms. Moreover, selection of best 20% feature set with info gain algorithm provide the same performance level as with the entire feature set. Further we observed the higher number of dimensions was due to bigrams and the impact of n grams towards feature selection algorithms. Moreover, it is worth noting that removing the features which occur twice in a document would be ideal before moving to apply feature selection techniques with different algorithms.
URI: http://repository.kln.ac.lk/handle/123456789/27346
Appears in Collections:Smart Computing and Systems Engineering - 2023 (SCSE 2023)

Files in This Item:
File Description SizeFormat 
Proceeding SCSE 2023 (3) 8.pdf11.58 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.