Monday 29 August 2011

Hierarchical clustering

This method of clustering identifies relatively homogeneous groups of cases (or variables) based on selected characteristics, using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left. The hierarchy of clusters created may be represented in a tree structure called a dendogram. The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual observations.

Distance or similarity measures are generated by the Proximities procedure.

Algorithm

Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters.

Any non-negative-valued function may be used as a measure of similarity between pairs of observations. The choice of which clusters to merge or split is determined by a linkage criterion, which is a function of the pairwise distances between observations.

Data

· The variables can be quantitative, binary, or count data.

· Scaling of variables is an important issue--differences in scaling may affect your cluster solution(s).

· If your variables have large differences in scaling (for example, one variable is measured in dollars and the other is measured in years), you should consider standardizing them (this can be done automatically by the Hierarchical Cluster Analysis procedure).

· You should include all relevant variables in your analysis. Omission of influential variables can result in a misleading solution.

Example :

Data to be clustered Hierarchical clustering dendrogram

K- Means clustering

Relatively homogeneous groups of cases are identified based on selected characteristics. An algorithm is used which requires one to specify the number of clusters. You can select one of two methods for classifying cases, either updating cluster centers iteratively or classifying only.

Algorithm

The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster

Algorithm steps:

· Choose the number of clusters, k.

· Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers.

· Assign each point to the nearest cluster center, where "nearest" is defined with respect to one of the distance measures discussed above.

· Recompute the new cluster centers.

· Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).

Data :

· Variables should be quantitative at the interval or ratio level. If your variables are binary or counts, use the Hierarchical Cluster Analysis procedure.

· Distances are computed using simple Euclidean distance. If you want to use another distance or similarity measure, use the Hierarchical Cluster Analysis procedure.

· Scaling of variables is an important consideration. If your variables are measured on different scales (for example, one variable is expressed in dollars and another variable is expressed in years), your results may be misleading. In such cases, you should consider standardizing your variables before you perform the k-means cluster analysis.

· The procedure assumes that you have selected the appropriate number of clusters and that you have included all relevant variables. If you have chosen an inappropriate number of clusters or omitted important variables, your results may be misleading.

Advantage

The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets.

Disadvantage

Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments

Box Plot

A BoxPlot displays the distribution of scale variable and pinpointing outliers. We can create a 2-D BoxPlot that is summarized for each category in a categorical variable, or we can create a 1-D BoxPlot that is summarized for all cases in the data.

We can create box plots by clicking on GraphsàLegacy Dialogs à Box Plot

We can choose either the Simple or the Clustered option depending on the kind of Box Plot we want. Also, this could be seen in the form of summaries for groups of cases or variables as the case may be. Next, we move the variable from the left panel to the right one. Clicking on “Ok” gives us the Box Plot which looks something like

The top of the box represents the 75th percentile, the bottom of the box represents the 25th percentile, and the line in the middle represents the 50th percentile. The whiskers (the lines that extend out the top and bottom of the box) represent the highest and lowest values that are not outliers or extreme values. Outliers and extreme values are represented by circles beyond the whiskers.

Source :

http://academic.udayton.edu/gregelvers/psy216/spss/graphs.htm

http://en.wikipedia.org

http://www.kovcomp.co.uk/support/XL-Tut/demo-cluster2.html

Group : Marketing 3

Author of the Article: Sahil Kotru (13099)

No comments:

Post a Comment