Monday 29 August 2011

K-means clustering

K-means clustering is a Partitioning (or top-down) clustering method. It is an algorithm for clustering data that allows the user to choose the number of clusters he would want to have from a given set, based on some similarity.

Procedure

· Randomly split the data into k groups of equal number of variables/data

· Calculate the centroid (centre point) of each group

· Reassign variables to the centroid to which it is most similar

· Calculate a new centroid for each group, reassign variables etc

· This iteration has to go on until the centroid is stable (centroid remains constant)

Advantages

· With a large data, K-Means is faster than hierarchical clustering (if the number of clusters are small).

· Data reduction is accomplished by replacing the coordinates of each point in a cluster with the coordinates of that cluster’s center point (centroid). Handling data becomes easy now.

Disadvantages

· One must specify the number of clusters as an input. The model is not capable of determining the appropriate number of clusters and depends upon the user input. This method tries to establish the centers of clusters with initial data set. If the data is very random and the centers are not stable, then every iteration will give different results.

Practical examples

1) Operations

An FMCG company is looking at entering India in big scale. Apart from getting its own brands into India, it is looking at setting up entire supply chain network all over India. Setting up warehouses in each city wouldn’t make sense as it might just add to the costs, without additional benefits. K-cluster framework helps in choosing optimal locations for its warehouses. The variables needed the number of warehouses it chooses to setup in India. The clustering is done based on the average distances of the warehouses from the centroid of each cluster. The company can setup warehouse in each of the centriod locations, which could be used to cater to other locations in the cluster.

2) Banking & financial services

A bank is looking at analysing customers in the corporate world to make decision on who to target, how much to offer and the loan rate. The company is looking at grouping the companies in terms of strength of the business and the risk profile of the company, so that the level of treatment can be same for each group. K-means algorithm can be introduced here, based on certain financial parameters/ratios (like cash ratio, inventory turnover, ROA, ROE etc). The K-means clustering can be used to find out the ideal company which is low on risk and high on strength and then create cluster of companies around the ideal company. The companies which are in the cluster of ideal company can be targeted for loans, and can have same loan amount and rate. The companies in other clusters could have different rates based on risk profile.

3) Marketing

A marketing company is looking at conducting a market research on its new product, which is expected to have pan-India launch soon. Interviewing the entire universe of population is not feasible, however, it seeks responses from each group based on certain characteristics (characteristics could be age, gender, ethnicity, religion etc) K-means cluster analysis attempts to identify relatively similar groups of respondents based on selected characteristics, using a method that can handle large numbers of respondents. This procedure attempts to identify similar groups of respondents based on selected characteristics.

1 comment: