Monday, 29 August 2011

Overview on Clustering!!!

Agglomerative Hierarchical Clustering Overview

Agglomerative hierarchical clustering is a bottom-up clustering method where clusters have sub-clusters, which in turn have sub-clusters, etc. The classic example of this is species taxonomy. Gene expression data might also exhibit this hierarchical quality (e.g. neurotransmitter gene families). Agglomerative hierarchical clustering starts with every single object (gene or sample) in a single cluster. Then, in each successive iteration, it agglomerates (merges) the closest pair of clusters by satisfying some similarity criteria, until all of the data is in one cluster.

The hierarchy within the final cluster has the following properties:

· Clusters generated in early stages are nested in those generated in later stages.

· Clusters with different sizes in the tree can be valuable for discovery.

A Matrix Tree Plot visually demonstrates the hierarchy within the final cluster, where each merger is represented by a binary tree.

Process

· Assign each object to a separate cluster.

· Evaluate all pair-wise distances between clusters.

· Construct a distance matrix using the distance values.

· Look for the pair of clusters with the shortest distance.

· Remove the pair from the matrix and merge them.

· Evaluate all distances from this new cluster to all other clusters, and update the matrix.

· Repeat until the distance matrix is reduced to a single element.


Advantages


· It can produce an ordering of the objects, which may be informative for data display.

· Smaller clusters are generated, which may be helpful for discovery.


Disadvantages


· No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage. The result should be examined closely to ensure it makes sense.

· Use of different distance metrics for measuring distances between clusters may generate different results. Performing multiple experiments and comparing the results is recommended to support the veracity of the original results.

K-Means Clustering

K-Means clustering generates a specific number of disjoint, flat (non-hierarchical) clusters. It is well suited to generating globular clusters. The K-Means method is numerical, unsupervised, non-deterministic and iterative.


K-Means Algorithm Properties

· There are always K clusters.

· There is always at least one item in each cluster.

· The clusters are non-hierarchical and they do not overlap.

· Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'center' of clusters.


The K-Means Algorithm Process


· The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points.

· For each data point:

· Calculate the distance from the data point to each cluster.

· If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster.

· Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends.

· The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intracluster distances and cohesion.


Advantages to Using this Technique

· With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).

· K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.


Disadvantages to Using this Technique

· Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome).

· Fixed number of clusters can make it difficult to predict what K should be.

· Does not work well with non-globular clusters.

· Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as well as different K values, to compare the results achieved.


Source: http://www.improvedoutcomes.com


Written by: Afreen Khan

Group: Marketing1

No comments:

Post a Comment