A Comparative Study of Clustering English News Articles Using Clustering Algorithms

Disayiram, N.; Rupasingha, R. A. H. M.

UoK Repository Home
→
Science
→
Symposia & Conferences
→
Smart Computing and Systems Engineering (SCSE)
→
Smart Computing and Systems Engineering - 2022 (SCSE 2022)
→
View Item

dc.contributor.author	Disayiram, N.
dc.contributor.author	Rupasingha, R. A. H. M.
dc.date.accessioned	2022-10-31T08:50:09Z
dc.date.available	2022-10-31T08:50:09Z
dc.date.issued	2022
dc.identifier.citation	Disayiram N.; Rupasingha R. A. H. M. (2022), A Comparative Study of Clustering English News Articles Using Clustering Algorithms, International Research Conference on Smart Computing and Systems Engineering (SCSE 2022), Department of Industrial Management, Faculty of Science, University of Kelaniya Sri Lanka. 108-113.	en_US
dc.identifier.uri	http://repository.kln.ac.lk/handle/123456789/25411
dc.description.abstract	The news informs us of what is going on in the world. People nowadays read their interesting news on news websites. There are numerous categories of news. Each newsreader has a different preference for news categories. Sportspeople prioritize sports news, whereas technology fans pay attention to the technology segment of the news. At the end of the day, each news category is important. Every day, a large amount of information is released on news websites. News sites usually categorize the news however, not all of the categories are published on those sites. Some categories are given higher attention by news outlets, while others receive less coverage. As a result, finding an appropriate category of news is tough. These issues make it difficult for newsreaders and content seekers to find relevant sections on news websites. The clustering of English news articles by relative category provides solutions to these issues. This research aims to use clustering algorithms to cluster news articles depending on the relevant domain/cluster. We consider five news categories: politics, sports, health, technology, and business. The data collected online was converted into a vector format using the term frequency-inverse document frequency (TF-IDF) vectorization. Then, on the body of the news and the news heading, the three clustering algorithms: Expectation-Maximization (EM), Simple K-means, and Hierarchical Clustering based on an agglomerative approach were applied individually. The Waikato Environment for Knowledge Analysis (WEKA) tool's classes to clusters evaluation model are used to calculate the accuracy. The EM method had the maximum accuracy of 88.5% with the best results in terms of correctly clustered instances. The comparison between the heading of news and the body of news demonstrates that the body of news clustered the news items better than the heading of news.	en_US
dc.publisher	Department of Industrial Management, Faculty of Science, University of Kelaniya Sri Lanka	en_US
dc.subject	clustering, domain, Machine Learning, news article	en_US
dc.title	A Comparative Study of Clustering English News Articles Using Clustering Algorithms	en_US