Monday, 29 August 2011

Clustering... The Hierarchical & K-Means Saga...!!!

Cluster analysis could be divided into hierarchical clustering and non-hierarchical clustering techniques. Examples of hierarchical techniques are single linkage, complete linkage, average linkage, median, and Ward. Non-hierarchical techniques include k-means, adaptive k-means, k-medoids, and fuzzy clustering.

To determine which algorithm is good is a function of the type of data available and the particular purpose of analysis. In more objective way, the stability of clusters can be investigated in simulation studies.

The problem of selecting the “best” algorithm/parameter setting is a difficult one. A good clustering algorithm ideally should produce groups with distinct non-overlapping boundaries, although a perfect separation cannot typically be achieved in practice. Figure of merit measures (indices) such as the silhouette width or the homogeneity index can be used to evaluate the quality of separation obtained using a clustering algorithm.

The concept of stability of a clustering algorithm was considered in. The idea behind this validation approach is that an algorithm should be rewarded for consistency.

K-Means algorithm is an unsupervised clustering algorithm that classifies the input set of data into multiple clusters based on their distance from each other. Based on the distance metric specified the algorithm tries to group the input data into various clusters. The points are clustered around centroids μi i = 1... k which are obtained by minimizing the objective

k

V = Σ Σ (xj – μi)2

i=1 xj ϵ Si

Where there are k clusters Si, i = 1, 2, . . . , k.

μi is the mean point of all the points xj ϵ Si.

The algorithm has following steps:

1. Compute the intensity distribution of the intensities.

2. Choose k centroids randomly.

3. Repeat the following steps until the cluster does not change anymore.

4. Cluster the points based on distance of their intensities from the centroid intensities.

c(i) := arg min || x(i) - μj || 2

j

5. Compute the new centroid or mean point for each clusters.

m

Σ 1{ c(i) = j } x(i)

i=1

μi := _________________

m

Σ 1{ c(i) = j }

i=1

Where k is the number of clusters to be found, i iterates over the all the intensities, j iterates over all the centroids and μi are the centroid intensities.

Traditional K-Means:

Step 1: Accept the number of clusters to group data into and the dataset to cluster as input values.

Step 2: Initialize the first K clusters

- Take first k instances or

- Take Random sampling of k elements

Step 3: Calculate the arithmetic means of each cluster formed in the dataset.

Step 4: K-means assigns each record in the dataset to only one of the initial clusters

- Each record is assigned to the nearest cluster using a measure of distance (e.g Euclidean distance).

Step 5: K-means re-assigns each record in the dataset to the most similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset.

Hierarchical Clustering:

Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering is:

  1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.
  2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less.
  3. Compute distances (similarities) between the new cluster and each of the old clusters.
  4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.

In single-linkage clustering (also called the connectedness or minimum method), we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.

If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.

In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.

In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

A variation on average-link clustering is the UCLUS method of which uses the median distance, which is much more outlier-proof than the average distance.

This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. There is also a divisive hierarchical clustering which does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces. Divisive methods are not generally available, and rarely have been applied.

By Shruthi Mylaram

Group 2 – Ops

No comments:

Post a Comment