Monday 29 August 2011

Limitation of Cluster Analysis

Limitations of Cluster Analysis

There are several things to be aware of when conducting cluster analysis:

1. The different methods of clustering usually give very different results. This occurs because of the different criterion for merging clusters (including cases). It is important to think carefully about which method is best for what you are interested in looking at.

2. With the exception of simple linkage, the results will be affected by the way in which the variables are ordered.

3. The analysis is not stable when cases are dropped: this occurs because selection of a case (or merger of clusters) depends on similarity of one case to the cluster. Dropping one case can drastically affect the course in which the analysis progresses.

4. The hierarchical; nature of the analysis means that early ‘bad judgments’ cannot be rectified.

Differences between Hierarchical clustering and Non hierarchical clustering

Hierarchical clustering Non hierarchical clustering

• No decision about the number of clusters • Faster, more reliable

• Problems when data contain a high level of error • Need to specify the number of clusters (arbitrary)

• Can be very slow • Need to set the initial seeds (arbitrary)

• Initial decision are more influential (onestep only)

Distance between Cluster Pairs

The most frequently used methods for combining clusters at each stage are available in SPSS. These methods define the distance between two clusters at each stage of the procedure. If cluster A has cases 1 and 2 and if cluster B has cases 5, 6, and 7, you need a measure of how different or similar the two clusters are.

Nearest neighbour (single linkage). If you use the nearest neighbour method to form clusters, the distance between two clusters is defined as the smallest distance between two cases in the different clusters. That means the distance between cluster A and cluster B is the smallest of the distances between the following pairs of cases: (1,5), (1,6), (1,7), (2,5), (2,6), and (2,7). At every step, the distance between two clusters is taken to be the distance between their two closest members.

Furthest neighbour (complete linkage). If you use a method called furthest neighbour (also known as complete linkage), the distance between two clusters is defined as the distance between the two furthest points.

Average linkage within groups. The UPGMA method considers only distances between pairs of cases in different clusters. A variant of it, the average linkage within groups, combines clusters so that the average distance between all cases in the resulting cluster is as small as possible. Thus, the distance between two clusters is the average of the distances between all possible pairs of cases in the resulting cluster. The next three methods use squared Euclidean distances.

Ward’s method. For each cluster, the means for all variables are calculated. Then, for each case, the squared Euclidean distance to the cluster means is calculated. These distances are summed for all of the cases. At each step, the two clusters that merge are those that result in the smallest increase in the overall sum of the squared within-cluster distances. The coefficient in the agglomeration schedule is the within-cluster sum of squares at that step, not the distance at which clusters are joined.

Centroid method. This method calculates the distance between two clusters as the sum of distances between cluster means for all of the variables. In the centroid method, the centroid of a merged cluster is a weighted combination of the centroids of the two individual clusters, where the weights are proportional to the sizes of the clusters. One disadvantage of the centroid method is that the distance at which clusters are combined can actually decrease from one step to the next. This is an undesirable property because clusters merged at later stages are more dissimilar than those merged at early stages.

Median method. With this method, the two clusters being combined are weighted equally in the computation of the centroid, regardless of the number of cases in each. This allows small groups to have an equal effect on the characterization of larger clusters into which they are merged.

Source - http://www.norusis.com/pdf/SPC_v13.pdf

Author -Sampath

Group - Operations 3



No comments:

Post a Comment