Monday, 29 August 2011

Cluster Analysis helps linking up Diapers with Alcohol!! (Cluster analysis + Data Mining + Retail)

After going through 6 marathon lectures on business analytics, most of which focussed on cluster analysis and its application, I was reminded of one of the subjects I studied back in my engineering- data mining and warehousing. There were loads of other concepts which I studied under the same topic, something related to ROLAP and MOLAP was the one that I found to be very closely related to cluster analysis. Both of them, work on querries to generate desired results as per requirement of the user. The user may be an engineer or someone working for a retail company.

Now where I found data mining to be very closely linked to cluster analysis was in the way it works. A highly specialised field with extremely complex algorithms, it normally deals with large amount of data, none of which makes any sense in the first go. There is also mostly never any fixed agenda when say the data of a retail outlet is studied via data mining. What it aims to achieve is to identify some relationship between attributes which can be taken advantage of to boost sales and improve operational efficiencies of the retail outlets. As an example, in a study conducted in a retail store chain in USA, it was found that on Saturdays married couples purchased beer and diapers together in a lot of cases. Though the relationship was initially not known and not understood in the first go, but when further analysed, it was found that these young married couples drink a lot on the weekend and they know they would forget to attend to their babies, as a result of which they buy diapers as a precautionary measure. Now how would any algorithm identify any such pattern from a raw set of data. The answer is – through cluster analysis, which I understood in these past few lectures. Data is fed into processing systems, wherein the algorithm (which works on cluster analysis techniques) uses techniques of grouping together various attributes (via dendograms). It does so for all permutation and combinations of the various attributes until it identifies various clusters which show a relationship or which exhibit a lot of dependence on each other.

Once such relations are identified for a store chain in the retail sector, it can be made use of to plan customer schemes in order to boost sales for the same set of products. It also helps in re-layout of the store to keep together those products which sell together and sell a lot. It also helps identify those non-complementary products which have a strong connection of being sold together for whatever reason. Thus there are numerous applications of cluster analysis, be it for the retail industry or for medical purposes wherein genes and DNA are studied upon, what matters in the end is how an analyst interprets the results and implements them commercially for the benefit of all. Merely finding clusters is not the end of the job, the most important factor is to give it shape and form so that it can be understood by all.

Thanks !!

Submitted by:-

Kartik Arora

Roll No. 13140


Group 4


“Where is the life, we have lost in living?

Where is the wisdom, we have lost in knowledge?

Where is the knowledge, we have lost in information?”

- T. S. Elliot,

Choruses from the Rock

(1888 – 1965)

Today we have been taught about Cluster Analysis. I never knew the far reaching applications of something like ‘cluster analysis’ taught in a b-school.

This is what my understanding of Cluster Analysis is:

Cluster analysis is a collection of statistical methods, which identifies groups of samples that behave similarly or show similar characteristics. It is used to reduce the complexity of data. It generates groups which are similar. It is homogen

eous within the group and as much as possible heterogeneous to other groups. Data consists usually of objects or persons, and the segmentation is based on more than two variables.

Examples for datasets used for cluster analysis:

- Socio-economic criteria: income, education, profession, age, number of children, size of city of residence

- Psychographic criteria: interest, life style, moti

vation, values, involvement

- Criteria linked to the buying behaviour: price range, type of media used, intensity of use, choice of retail outlet, fidelity, buyer/non-buyer, buying intensity

Interesting Applications: Mapping Crime

A hot spot is a condition indicating some form of clustering in a spatial distribution. However, not all clusters are hot spots because the environments that help generate crime—the places where people are—also tend to be clusters. Hot spots are small places in which the occurrence of crime is so frequent that it is highly predictable, at least over a 1-year period.

Cluster analysis methods depend on th

e proximity of incident points. Typically, an arbitrary starting point ("seed") is established. This seed point could be the center of the map. The program then finds the data point statistically farthest from there and makes that point the second seed, thus dividing the data points into two groups. Then distances from each seed to other points are repeatedly calculated, and clusters based on new seeds are developed so that the sums of within-cluster distances are minimized. The figure shown below illustrates hot spots derived from the Spatial and Temporal Analysis of Crime (STAC) method, which performs the functions of radial search and identification of events concentrated in a given area (Levine, 1996).


I never thought that Cluster analysis could also have such unimaginable and practical applications even in such unsought departments. I hope we will be able to efficiently apply these techniques and methods in various tasks.

Posted By –

Kunal Mandpe

Marketing 6

Roll No: 13084

A of Hierarchical clustering

Considering the fact that blogging does not really interest me, the very thought of it,
that too as an assignment, viewable to the whole BA batch makes me nervous.
And as i was looking for a topic to write on (based on what i understood in the class),
the idea of reflecting back on the basics of hierchical clustering struck my mind.
So lets begin with the baby steps to understanding hierchical cluster.

Hierchical clustering is used when the data is less than 50 and K-means is used when the data is more than 50.

If , for example, a set of 9 items is to be clustered, with a 9*9 distance matrix, then we should follow the steps given below:

a). Assign each item to its own cluster, such that if you have 9 items, you now have 9 clusters,
each containing just one item. Let the distances between the clusters equal the distances between the items they contain.

b). Find the closest (most similar) pair of clusters and merge them into a single cluster.

c). Compute distances (similarities) between the new cluster and each of the old clusters.

d). Repeat steps c and d until all items are clustered into a single cluster of size 9.

Computation of distances can be done in many ways, viz, single-link, complete-link or average-link clustering.

Single-linkage : Distance between the two closest elements in the two clusters.
Complete-linkage: It is the longest distance between any member of one cluster to any member of the other clusters.
Average linkage : It is the average distance between any member of one cluster to any member of the other clusters.

Agglomeration schedule coefficients depends upon the distances between the two clusters. In dendogram, we use
the average linkage clustering.
The cut-off line is the point at which the distance between the two clusters is the longest. and then we have the usage
of crosstabs, proximity matrix, boxplot (used to find the outliers) etc.
Hope this adds to your understanding of Hierchical clustering......

Group- HR1
Author- Tage Otung

To Infinity and Beyond....

Having missed the first session of Business Analytics and the mighty scare received from the class mates meant just one thing- panic! On top of them a group member informed about the blogging idea the Sir had bamboozled us with! Ohh eMm Gee!!! As a suggestion from the group mate I volunteered to write the next day, (the earlier the better, or in my case, safer) and hence ended up trying to hang on to every word of the prof and (trying) to take copious notes.

Mid way through the first lecture, I had my eureka moment, ‘This isn’t difficult’. At all. Fun actually. Ha! What’s a bunch of numbers to someone who had come out alive and unscathed (yes!) from the dangerous waters of engineering!

I digress. So when I got about discussing with dad the cool new subject we were learning, I realised that we have all these new fancy softwares to crunch and spew numbers in a matter of seconds. But how much of it actually makes sense? Okay so more number of men are using the prepaid connection of a certain telecom service provider than women studying in class XI in particular locality who would rather choose to hear some mellifluous jazz to heavy metal. You get my drift ;)

But, SO WHAT??

You have a lot of answers but no solutions. Like having terabytes of memory but no imagination! Now what use would that be. And ah! It does feel glorious to step in as an almost MBA brandishing the shiny new sword of strategic implication. To the rescue...

All this data and results you finally come up with and those eerily inviting diagrams with strangely twisted forks and twinkly stars do have to eventually get you somewhere.

My friends here have already written ballads in the love of dendrograms and agglomeration schedules and proximity matrices et all and it would be unjust to even start to believe I could do a better job.

As I write this post, I am wondering about the academic usefulness of it and frankly, I am as stumped as you are. But you see, it is an assignment (or so I have been lead to believe) so I have to justify the academic relevance of it. I did come up with so much gyan already, a little bit more won’t hurt now, will it?

As you do plough on through this post, and thank you for that, I hope you realise the importance of sifting through this hillock of data. So mid way through another session of BA, if you put up your hands in sheer frustration and decide to call it quits (away from the ever watchful eyes of the prof) do remember this one thing,

“Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.”

Thus said the great Albert Einstein. But then again, our cradle is after all, The birth place of Business Leaders. So fight it out friends, soldiers and countrymen. And may you dazzle the world with your Analytics in Business.

P.S.- You see my absence from the class when Sir discussed the blog writing is my plead for innocence in case of non-conformity of this blog post and its silly title.

Group :Marketing 3

Author of the Article : Aruna Iyer

Hierarchical clustering

This method of clustering identifies relatively homogeneous groups of cases (or variables) based on selected characteristics, using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left. The hierarchy of clusters created may be represented in a tree structure called a dendogram. The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual observations.

Distance or similarity measures are generated by the Proximities procedure.


Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters.

Any non-negative-valued function may be used as a measure of similarity between pairs of observations. The choice of which clusters to merge or split is determined by a linkage criterion, which is a function of the pairwise distances between observations.


· The variables can be quantitative, binary, or count data.

· Scaling of variables is an important issue--differences in scaling may affect your cluster solution(s).

· If your variables have large differences in scaling (for example, one variable is measured in dollars and the other is measured in years), you should consider standardizing them (this can be done automatically by the Hierarchical Cluster Analysis procedure).

· You should include all relevant variables in your analysis. Omission of influential variables can result in a misleading solution.

Example :

Data to be clustered Hierarchical clustering dendrogram

K- Means clustering

Relatively homogeneous groups of cases are identified based on selected characteristics. An algorithm is used which requires one to specify the number of clusters. You can select one of two methods for classifying cases, either updating cluster centers iteratively or classifying only.


The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster

Algorithm steps:

· Choose the number of clusters, k.

· Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers.

· Assign each point to the nearest cluster center, where "nearest" is defined with respect to one of the distance measures discussed above.

· Recompute the new cluster centers.

· Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).

Data :

· Variables should be quantitative at the interval or ratio level. If your variables are binary or counts, use the Hierarchical Cluster Analysis procedure.

· Distances are computed using simple Euclidean distance. If you want to use another distance or similarity measure, use the Hierarchical Cluster Analysis procedure.

· Scaling of variables is an important consideration. If your variables are measured on different scales (for example, one variable is expressed in dollars and another variable is expressed in years), your results may be misleading. In such cases, you should consider standardizing your variables before you perform the k-means cluster analysis.

· The procedure assumes that you have selected the appropriate number of clusters and that you have included all relevant variables. If you have chosen an inappropriate number of clusters or omitted important variables, your results may be misleading.


The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets.


Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments

Box Plot

A BoxPlot displays the distribution of scale variable and pinpointing outliers. We can create a 2-D BoxPlot that is summarized for each category in a categorical variable, or we can create a 1-D BoxPlot that is summarized for all cases in the data.

We can create box plots by clicking on GraphsàLegacy Dialogs à Box Plot

We can choose either the Simple or the Clustered option depending on the kind of Box Plot we want. Also, this could be seen in the form of summaries for groups of cases or variables as the case may be. Next, we move the variable from the left panel to the right one. Clicking on “Ok” gives us the Box Plot which looks something like

The top of the box represents the 75th percentile, the bottom of the box represents the 25th percentile, and the line in the middle represents the 50th percentile. The whiskers (the lines that extend out the top and bottom of the box) represent the highest and lowest values that are not outliers or extreme values. Outliers and extreme values are represented by circles beyond the whiskers.

Source :

Group : Marketing 3

Author of the Article: Sahil Kotru (13099)

k-Means Clustering - Views of a Marketer!

Of k-Means Clustering
Today we will talk about k-Means clustering …. What it is and where did the term k-means come from.
As a student of marketing we will also try to understand a little about how can k-Means clustering be used in marketing.  

The term "k-means" was first used by James MacQueen in 1967, though the idea goes back to Hugo Steinhaus in 1957. The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, though it wasn't published until 1982.
In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. The main advantages of this is that it is simple and the most popular method of partitioning data.
To start with today was my second class in Business Analytics on SPSS….. not to mention how I dreaded this class even before it started due to my weakness with numbers. I did never realize that SPSS could so simple a tool to handle even though it is loaded with various resources to make analysis and one of them is what we I am going to talk about.

The benefits of k-Means clustering to a marketing student can be in various fields:
Retail: Can be used to cluster similar Merchandise.
Market Research: Can be used to cluster similar data like demography, etc
For Academic Purpose:
The k-Means Clustering algorithm can be used for prediction of students academic performance.

At the end I would like to say that k-means is a very convenient way to cluster large number of data and also helpful to a marketer.

Author: Archit Tamakhuwala
Group Name: Marketing 2

Source: Wikipedia

Best Use of Cluster Analysis - A Enthusiast's Viewpoint

How can cluster analysis be best used?
The objective of cluster analysis is to assign observations to groups (clusters) so that observations within each group are similar to one another with respect to variables or attributes of interest and the groups themselves stand apart from one another. In other words, the objective is to divide the observations into homogeneous and distinct groups. In contrast to the classification problem where each observation is known to belong to one of a number of groups and the objective is to predict the group to which a new observation belongs, cluster analysis seeks to discover the number and composition of the groups.
Before we do a clustering of the data, we must understand the objectives of the same. Or else at the end of everything we will be left with loads of charts and graphs and nothing valuable to decipher from all that data. So we must understand why we need to cluster information under one head and what values can be added by correlating the different clusters that we make. For example, if we group the features and attributes of a product such as a mobile phone which concerns the product features such as sms, wi-fi, 3g, alarm, clock, radio, music player, torch, etc then we might be able to tell what are the things we should focus on depending on the customer preferences and dislikes. We can tell whether any feature is underutilised in any of the groups of people, if some features are not adding any value to the customer but he has to pay for it, whether a bundling of features can be given to the customers which will make their mobile experience more useful, time saving and convenient.
Uses of cluster analysis in marketing
The primary use of cluster analysis in marketing has been for market segmentation. And now market segmentation has become an important tool for both academic research and applied marketing. Today the marketing world is equipped with hundreds of market segmentation techniques and most of these techniques instead of bringing simplicity, confuses the marketer. These techniques have served to shift discussions of researchers form more substantive issues of meta-research directed at integrating market segmentation research. One of the major areas of future research should be the evaluation of the conditions under which various data analytical techniques are most appropriate.
All segmentation research, regardless of the method used, is designed to identify groups of entities (people, markets, organizations) that share certain common characteristics (attitudes, purchase propensities, media habits, etc.). Stripped of the specific data employed and the details of the purposes of a particular study, segmentation becomes a grouping task.
It has been noticed that researchers tend to select grouping methods largely on the basis of familiarity, availability, and cost rather than on the basis of the methods’ characteristics and appropriateness. These practices can be attributed to the lack of research on similarity measures, grouping (clustering) algorithms, and effects of various data transformations.
A second and equally important use of cluster analysis has been in seeking a better understanding of buyer behaviours by identifying homogeneous groups of buyers. Cluster analysis has been less frequently applied to this type of theory-building problem, possibly because of theorists’ discomfort with a set of procedures which appear ad hoc. Cluster analysis is one means for developing such taxonomies.
Cluster analysis has been employed in the development of potential new product opportunities. By clustering brands/products, competitive sets within the larger market structure can be determined. Thus, a firm can examine its current offerings vis-à-vis those of those of its competitors. The firm can determine the extent to which a current or potential product offering is uniquely positioned or is in a competitive set with other products. Although cluster analysis has not been used frequently in such applications, largely because of the availability of other techniques such as multidimensional scaling, it is not uncommon to find cluster analysis used as an adjunct to these other techniques. Cluster analysis has also been suggested as an alternative to factor discriminant analysis. In such applications, in which case cluster analysis would not be used as a classification technique and the analyst would face a different set of issues from those addressed here.
Cluster analysis has also been employed by several researchers in the problem of test market selection. Such applications are concerned with the identification of relatively homogeneous sets of test markets which may become interchangeable in test market studies. The identification of such homogeneous sets of test markets allows generalization of the results obtained in ine test market to other test markets in the same cluster, thereby reducing the number of test markets required.
Finally the cluster analysis has been used as a general data reduction technique to develop aggregates of data which are more general and more easily managed than individual observations. For e.g. limits on the number of observations that can be used in multidimensional scaling programs often necessitate an initial clustering of observations. Homogeneous clusters then become the unot of analysis for the multidimensional scaling procedure.
The lack of speciality about the method of clustering in some of the marketing studies tells about the problems associated with the use if of cluster analysis. The lack of detailed reporting suggests either an ignorance of or a lack of concern for the important parameters of the clustering method used. Failure to provide specific information about the method also tends to inhibit replication and provides little guidance for the other researchers who might seek an appropriate method of cluster analysis. Use of specific program names rather than the more general algorithm name impedes inter-study comparisons. This situation suggests a need for a sound review of clustering methodology for the market researcher. Though they mention some empirical work on the characteristics of these measures and algorithms, their report is primarily a catalogue of techniques and some marketing applications. Relatively little guidance is provided to the researcher who is seeking to discover the characteristics and limitations of various grouping procedures.

Chayan Ray

K-means clustering

K-means clustering is a Partitioning (or top-down) clustering method. It is an algorithm for clustering data that allows the user to choose the number of clusters he would want to have from a given set, based on some similarity.


· Randomly split the data into k groups of equal number of variables/data

· Calculate the centroid (centre point) of each group

· Reassign variables to the centroid to which it is most similar

· Calculate a new centroid for each group, reassign variables etc

· This iteration has to go on until the centroid is stable (centroid remains constant)


· With a large data, K-Means is faster than hierarchical clustering (if the number of clusters are small).

· Data reduction is accomplished by replacing the coordinates of each point in a cluster with the coordinates of that cluster’s center point (centroid). Handling data becomes easy now.


· One must specify the number of clusters as an input. The model is not capable of determining the appropriate number of clusters and depends upon the user input. This method tries to establish the centers of clusters with initial data set. If the data is very random and the centers are not stable, then every iteration will give different results.

Practical examples

1) Operations

An FMCG company is looking at entering India in big scale. Apart from getting its own brands into India, it is looking at setting up entire supply chain network all over India. Setting up warehouses in each city wouldn’t make sense as it might just add to the costs, without additional benefits. K-cluster framework helps in choosing optimal locations for its warehouses. The variables needed the number of warehouses it chooses to setup in India. The clustering is done based on the average distances of the warehouses from the centroid of each cluster. The company can setup warehouse in each of the centriod locations, which could be used to cater to other locations in the cluster.

2) Banking & financial services

A bank is looking at analysing customers in the corporate world to make decision on who to target, how much to offer and the loan rate. The company is looking at grouping the companies in terms of strength of the business and the risk profile of the company, so that the level of treatment can be same for each group. K-means algorithm can be introduced here, based on certain financial parameters/ratios (like cash ratio, inventory turnover, ROA, ROE etc). The K-means clustering can be used to find out the ideal company which is low on risk and high on strength and then create cluster of companies around the ideal company. The companies which are in the cluster of ideal company can be targeted for loans, and can have same loan amount and rate. The companies in other clusters could have different rates based on risk profile.

3) Marketing

A marketing company is looking at conducting a market research on its new product, which is expected to have pan-India launch soon. Interviewing the entire universe of population is not feasible, however, it seeks responses from each group based on certain characteristics (characteristics could be age, gender, ethnicity, religion etc) K-means cluster analysis attempts to identify relatively similar groups of respondents based on selected characteristics, using a method that can handle large numbers of respondents. This procedure attempts to identify similar groups of respondents based on selected characteristics.