Determination of Initial Centers in K-Means Clustering Method by NAMGY Algorithm

Meryem Goral Yildizli, Zeliha Nazan Alparslan

Abstract


Objective: With the development and widespread use of technology, the increasing volume of data in many areas has accelerated the digitization process. The gains obtained by processing and interpreting data stacks can make significant contributions to institutions and organizations in many managerial issues from production to decision making processes. It has increased the use of data mining methods in different areas, which support the process of transforming digitalized large scale data into information. One of the increasingly popular techniques in data mining is clustering, and the K means algorithm is a non hierarchical clustering method compatible with large amount of data. This method is widely used in the scientific studies, however the number of clusters and initial centers defined as parameters comes up a disadvantage for the algorithm, especially for those not familiar with the mathematical specificities. Initial centers those generated randomly by K means usually make the clustering results reaching non optimal. K means algorithm is very sensitive in initial centers. More consistent results of K means clustering can be achieved after computing more than one times. However, it is difficult to decide the computation limit, which can give the optimal result. An improvement of K means algorithm with this respect will be a contribution on overcoming this disadvantage for scientific studies. In order to solve this problem; NAMGY (Neighborhood and Midpoint Gain Yield) algorithm has been developed, which includes methods that provide optimal selection of parameters according to the properties of objects. This article covers the application of the method of determining the initial centers in NAMGY algorithm. Method: In order to analyze the accuracy of our proposed method, both the standard K means and NAMGY algorithm were applied on the classified data set those Iris, Yeast and Segment challenge. And also the performances of the algorithms in terms of the working principle were evaluated on the VitaminB12 data set obtained from the Cukurova University Balcalı Hospital Information Management System. Euclidean distances were calculated between objects and data sets were transformed into values in the range [0, 1] using normalization. Adjusted Rand index was used to evaluate the validity of clusterings. Results: According to the examined results; the applications that reveal the effects of the initial centers on the analysis process of the algorithms have been carried out with different approaches such as the working principle of the algorithm, the effect of the initial centers on the clustering results, the evaluation of the clustering performance. It was again concluded that professional selection of parameters is requirement to increase the usability of a clustering algorithm and the reliability of clustering results. The NAMGY algorithm uses a systematic way to find initial centers which reduces the number of dataset scans and will produce better accuracy in smaller number of iteration. NAMGY algorithm has proved to be better than traditional K means algorithm in terms of good quality results and analysis processes. According to the results generated; NAMGY provides a challenging algorithm for the disadvantage of the standard K mean algorithm. However further research is required to verify the capability of this algorithm when applied to data sets with more complex objects.

Keywords: Clustering, K means, Initial center, Data mining

DOI: 10.7176/JSTR/7-01-05


Full Text: PDF
Download the IISTE publication guideline!

To list your conference here. Please contact the administrator of this platform.

ISSN (online) 2422-8702