Monday 29 August 2011

Hierarchical Clustering - Step by step!!!


This article is specially for people like me who tend to forget things very easily (especially what is taught in class) and also for those who unfortunately couldn't make it to the class...!!

I shall try and describe (step by step) the things (only the ones I could retain by now) that we were taught in class today. Try solving the case along with this, it may give you more clarity.

So lets start..!!!

First of all, remember 3 steps that would help you in clustering your data.

1. Selection of variables - objective of forming clusters
2. Distance
3. Clustering criteria

Okay now let me directly go to the problem that we discussed - the "Cell_Inter". The variables in the data reflected the cell services and functions.

So we started with the Hierarchical clusters. (following is the step by step approach of hierarchical clustering)

To get started, choose the variables for the cluster. (We chose the functions of a cell phone like SMS, Alarm etc.)

Now go to "analyze", choose "clasify", then "heirarchical cluster"

Next you need to select all the similar variables (all features in our case) and put them in the "variables" box

Now check on "agglomeration schedule" in the "statistics" tab and check dendogram in the plot tab. You'll get the following table in the Output.
Go to the method tab. You'll find cluster method is by default set on "between group linkage" (which is what we want), select "binary" under "measure" and select "jaccard" from the drop down tab. Jaccard is used to find the distance between the variables. More clarity later on..!!

Now we need to figure out how many clusters we need. This can be done by trial and error method until we are satisfied with the clusters.

Use a cut off line at a point where the next object that joins the cluster, does that for a relatively longer distance. (the first longer distance)

To calculate the jaccard index (between Alarm and SMS) we divide the "YES's" from the difference of "TOTAL" and "NO's".

--> 162/(206-12) = 0.835

The proximity matrix shows the difference between all the variables. Higher the distance means more proximity between the variables.


Some important points to be remembered:

1. In hierarchical cluster maximum 50 variables can be clustered. Normally cases are not clustered under hierarchy as they are large in number.
2. K - means is used when the variables are more than 50.
3. We select "binary" when the variables have value of yes and no.
4. In binary we only use Jaccard or simple matching methods.
5. In a proximity matrix, the longest distance between the variables is 1 and the least difference is zero.

I hope this proves to be useful, especially around the exam time!!!

P.S. I may have missed some steps and the steps mentioned above may also be jumbled up.

Signing off
Ajvad Rehmani
Group - Marketing1

No comments:

Post a Comment