Monday, 29 August 2011

Dendrograms: Limitations

So here I am writing a blog on dendrogram! Sounds different, sounds interesting! Where do I start from? Do i write about the concepts learnt in class about dendrogram or should I just go a step further and write something new? I guess I’ll choose the latter.

So we all know what exactly a dendrogram is and how it is nothing but a pictorial representation which helps us in clustering. It is basically used to visualize how a cluster is formed.

We know all the good things associated with a dendrogram. However, what about its limitations? Is it always good and what are the shortcomings, if any? We need to address this question as well when we are trying to understand any concept.

Dendrogram can essentially be used only for a small number of observations. As the number of observations increases, it becomes difficult to distinguish the individual leaves. Another valid point is that the vertical axis represents the level of criterion at which any two clusters can be joined. Hence successive joining of clusters implies a hierarchical structure, meaning that these dendrograms are suitable only for hierarchical cluster analysis.

For large numbers of observations, these hierarchical cluster algorithms are proving to be time consuming. The computational complexity of the three popular linkage methods is of order O(n square), whereas the most popular non-hierarchical cluster algorithm, k-means ([R] cluster kmeans, is only of the order O(kn) where k is the number ofclusters and n the number of observations (Hand et al., 2001). Therefore k-means, a non-hierarchical method, is emerging as a popular choice in the data mining community.

Hence there is another popular graph being used now as the number of clusters increases. It is being referred to as the “Clustergram.” s. This graph is useful in exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

The Clustergram is understood to be a type of parallel coordinates plot where each observation is given a vector. The vector contains the observation’s location according to how many clusters the dataset was split into. The scale of the vector is the scale of the first principal component of the data.

I guess this amount of knowledge regarding the dendrogram and Clustergram is sufficient for the day. One can always get a number of Pdfs, research papers and dig deeper into the subject and unravel a lot more interesting facts and observations regarding the same.

Harshala D (Roll No 13174)

Finance: Group 5

No comments:

Post a Comment