Browsing by Author "Gunasekara, R.P.T.H."

Now showing 1 - 2 of 2

Optimization of SpdK-means Algorithm
(Faculty of Graduate Studies, University of Kelaniya, Sri Lanka, 2016) Gunasekara, R.P.T.H.; Wijegunasekara, M.C.; Dias, N.G.J.
This study was carried out to enhance the performance of the k-mean data-mining algorithm by using parallel programming methodologies. As a result, the Speedup k-means (SpdK-means) algorithm which is an extension of k-means algorithm was implemented to reduce the cluster building time. Although SpdK-means speed up the cluster building process, the main drawback was that the cumulative cluster density of the created clusters by the SpdK-means algorithm was different from the initial population. This means some elements (data points) were missed out in the clustering process which reduces the cluster quality. The aim of this paper is to discuss how the drawback was identified and how the SpdK-means algorithm was optimized to overcome the identified drawback. The SpdK-means clustering algorithm was applied to three datasets which was gathered from a Ceylon Electricity Board Dataset by changing the number of clusters k. For k=2, 3, 4 did not give any significant difference between the cumulative cluster density and the initial dataset. When the number of clusters were more than 4 (i.e., when k>=5), there was a significant difference on cluster densities. The density of each cluster was recorded and it was identified that the cumulative density of all clusters was different from the initial population. It was identified that about 1% of elements from total population were missing after clusters were formed. To overcome this identified drawback the SpdK-mean clustering algorithm was studied carefully and it was identified that there are elements which had equal distances from several cluster centroids were missed out in intermediate iterations. When an element had an equal distance to two or more centroids the SpdK-means algorithm was unable to identify to which cluster that the element should belong and as a result the element is not included in any cluster. If such element was included into all the clusters that had an equal distance and if this process is repeated to all such elements the cumulative cluster density will be highly increased from the initial population. Therefore, the SpdK-means was optimized by selecting one of the cluster centroids which had equal distance to one element. After many studies of selection methods and their outcomes, it was able to modify the SpdK-means algorithm to find suitable cluster to an equal distance element. Since, an element can belong to any cluster it is not possible give any priority to select a belonging cluster. As all centroids had equal distances from the elements, the algorithm will select one of the centroid from all equal centroids randomly. The developed optimized SpdK-means algorithm successfully solved the identified problem by identifying missing elements and including them in to the correct clusters. By analyzing the iterations when applied to the datasets, the number of iterations was reduced by 20% than the former SpdK-means algorithm. After applying optimized SpdK-means algorithm to above mentioned datasets, it was found that it reduces the cluster building time by 10% to 12% than the SpdK-means algorithm. Therefore, the cluster building time was further reduced than the former SpdK-means algorithm.
Performance of k-mean data mining algorithm with the use of WEKA-parallel
(University of Kelaniya, 2013) Gunasekara, R.P.T.H.; Dias, N.G.J.; Wijegunasekara, M.C.
This study is based on enhancing the performance of the k-mean data mining algorithm by using parallel programming methodologies. To identify the performance of parallelizing, first a study was done on k-mean algorithm using WEKA in a stand-alone machine and then compared with the performance of k-mean with WEKA-parallel. Data mining is a process to discover if data exhibit similar patterns from the database/dataset in the different areas like finance, retail industry, science, statistics, medical sciences, artificial intelligence, neuro science etc. To discover patterns from large data sets, clustering algorithms such as k-mean, k -medoid and, balance iterative reducing and clustering using hierarchies (BIRCH) are used. In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k (where k is the number of selected groups) clusters in which each observation belongs to the cluster with the nearest mean. The grouping is done by minimizing the sum of squared distances (Euclidean distances) between items and the corresponding centroid (Center of Mass of the cluster). As the data sets are increasing exponentially, high performance technologies are needed to analyze and to recognize patterns of those data. The applications or the algorithms that are used for these processes have to invoke data records several times iteratively. Therefore, this process is very time consuming and consumes more device memory on a very large scale. During the study of enhancing the performance of data mining algorithms, it was identified that the data mining algorithms that were developed for the parallel processing were based on the distributed, cluster or grid computing environments. Nowadays, the algorithms are required to implement the multi-core processor to utilize the full computation power of the processors. The widely used machine learning and data mining software, namely WEKA was first chosen to analyze clusters and identify the performance of k -mean algorithm. k -mean clustering algorithm was applied to an electricity consumption dataset to generate k clusters. As a result, the dataset was partitioned into k clusters along with their mean values and the time taken to build clusters was also recorded. (The dataset consists of 30000 entries and it was collected from the Ceylon Electricity Board). Secondly to reduce the time consumed, we selected parallel environment using WEKA-parallel (Machine Learning software). This is a new option of WEKA used for multi-core programming methodology that can be used to connect several servers and client machines. Here, threads are passed among machines to fulfill this task. The WEKA parallel was installed and established for some distributed server machines with one client machine. The same electricity consumption dataset was used with k -mean in WEKA-parallel. The speed of building clusters was increased when the parallel software was used. But the mean values of the clusters are not exact with the previously obtained clusters. By visualizing both sets of clusters it was identified that some border elements of the first set of clusters have jumped to other clusters. The mean values of clusters are changed because of those jumped elements. The experiment was done on a single core i3, 3.3 GHz machine with Linux operating system to find the execution time taken to create k number of clusters using WEKA for several different datasets. The same experiment was repeated on a cluster of machines with similar specifications to compute the execution time taken to create k number of clusters in a parallel environment using WEKA-parallel by varying the number of machines in the cluster. According to the results, WEKA-parallel significantly improves the speed of k-mean clustering. The results of the experiment for a dataset on the consumption of electricity consumers in the North Western Province are shown in Table 1. This study shows that the use of WEKA-parallel and parallel programming methodologies significantly improve the performance of the k-mean data mining algorithm for building clusters.