BA@SIBMB: 08/28/11

Sunday 28 August 2011

An afterthought about Crosstabs ....

It is of paramount importance to reduce the complexity of the data. Data is present in abundance but we segregate it and bring it to a form so that it can be used.

Analyzing data from online surveys is the toughest part of the game.

Frequency analysis provides a wide answer to the questions put across in the survey. Some basic questions include:

What % of people who gave responses is below 18 years old?

What is the average height of children who will test the product sample?

What is the average age of the people using public transport?

Cross tabulation analysis or crosstab gives greater meaning to the data. Some basic questions include:

What % of males took the test in the last year?

Are school children likely to go the new ice cream parlour than grown ups?

What % of cricketers tested positive for the banned drug in the last series?

What is a Cross Tab Analysis?
- A cross tabulation analysis is useful for showing how respondents answered on two or more questions at the same time.
- Shows a distribution between two variables
- Usually presented as a matrix in the form of a table

Why Would You Use It?
- Easy to understand and draw quick conclusions
- Tables can provide greater insight than single statistics.
- They are simple to create with Checkbox Survey.

As a basic rule, the control group or independent variable is on the X-axis (such as age, gender education, etc) while the dependent variable or group under study is located on the Y Axis.

Example Cross Tab.

Ages 20-29 Ages 30-39 Ages 40-49
Read publications online 84% 61% 36%
Pay bills online 65% 42% 28%
4+ hrs daily on Internet 87% 68% 47%

To make a cross tab within CHECKBOX, you have to have a few completed survey responses. Then simply navigate to the Reports Manager. Auto generate the report to start. Then Add an item to that report "Cross Tab". Now pick two questions that you want to cross tabulate. Run the report. You're done. Add a style sheet if you want.

Cross-tabs or cross tabulation is a quantitative research method appropriate for analyzing the relationship between two or more variables. Data about variables is recorded in a table or matrix. A sample is used to gather information about the variable. The most common type of data collected in cross tabulation is account of occurrences of the variables. This count or number is referred to as frequency. The matrix used to show the frequency of the occurrences of the variables being studied is called frequency distribution. A matrix is used to show and analyze frequencies for a particular group or designation.

Cross Tabulation Provides Structure for Quantitative Data.

Raw data is easier to manage and understand when it has structure. Tables permit data about variables to be organized. These tables are often called contingency tables. Contingency refers to the possibility that a relationship exits. Variables describe an attribute of a person, group, place, thing, or idea. Variables can be either categorical (qualitative) or quantitative. Categorical variables are descriptive, often indicating something about the group from which the data is derived. Examples of categorical variables are attribute labels or names.

Study One Variable with Cross-Tabs

Researchers refer to frequency tables by names that indicate number or arrangement of variables that are being studied. A univariate frequency table shows data about one variable. Often the data in a univariate table is put into groups that consist of a range of values or designations that have been given a value or rank. The ranks are then put into order. An example of univariate data would be the frequency at which students earn grade points and fall into the A, B, C or 4.0, 3.5, 3.0 categories for a college course.

Study Multiple Variables with Cross-Tabs

When a frequency table shows data for more than one variable, it is called joint orbivariate contingency table. Bivariate frequency tables often show data in a two-way arrangement. An example of bivariate data would be the frequency with which people from different regions (north, south, east or west) of the country select crunchy snack bars or chewy snack bars.

A Little More About Variables

Quantitative variables can fall into one of two types: Discrete or continuous. Discrete variables can only be an integer value -- that is, a number between zero and infinity. Continuous variables can be any one of the possible values between the permitted or agreed upon maximum and minimum values in a range of values. As a general rule, variable types -- discrete or continuous -- are not used together in the same frequency distribution.

Cross-Tabs Permit Comparisons of Frequency Distributions

Data from frequency distributions can also be shown in a visual way, as in a graph. Distributions are compared by looking at four features of the data: Centre, spread, shape, and irregularities. Centre refers to the point at which half of the data falls on either side of a central point. Spread refers to the variability of the data, with a wide spread indicating greater variability than a narrow spread. Shape refers to the symmetry, skewness, or peaks and valleys of the distribution. Irregularities refer to gaps or outliers in the data pattern.

A Crosstab differs from frequency distribution because the latter provides distribution of one variable only. A Cross Table has each cell showing the number of respondents which gives a particular combination of replies. An example of Cross Tabulation would be a 3 x 2 contingency table. One variable would be age group which has three age ranges: 10-30, 31-50, and 51-up. Another variable would be the choice of Ray Ban shades or fast Track. With a crosstab, it would be easy to for a company to see what the choices of shades are for the three age groups. For instance, the table would show that 45% of those aged 31-50 prefer Ray Ban, while only 10% of those aged 51-up prefer Fast Track. With the information, they can come up with moves which will be beneficial to the success of the business of Fast Track or Ray Ban. Cross Tabulations are popular choices for statistical reporting because of their simplicity and that they are laid out in clear formats. They can be used with any level of data whether the data is ordinal, nominal, interval or ratio because the Crosstab will treat all of them as if they are nominal data. Crosstab tables provide more detailed insights to a single statistics in a simple way and they solve the problem of empty or sparse cells.

Cross tabulations, or cross tabs, are a good way to compare two subgroups of information. Cross tabs allow you to compare data from two questions to determine if there is a relationship between them. Like frequency tables, cross tabs appear as a table of data showing answers to one question as a series of rows and answers to another question as a series of columns.

Base Question	Female	Male
Product Manager	57.2%	53.4%
Director	12.6%	14.2%
Product Marketing Manager	24.7%	23.1%
Program Manager	2.8%	1.5%
Technical Product Manager	2.8%	7.7%
Total Counts	215	337

Cross tabs are used most frequently to look at answers to a question among various demographic groups. The intersections of the various columns and rows, commonly called cells, are the percentages of people who answered each of the responses. In the example above, females and males had relatively similar distribution among various job titles, with the exception of the tile of "Technical Product Manager", where 2.5 times as many males had the title as compared to females. For analysis purposes, cross tabs are a great way to do comparisons.

Since Cross Tabulation is widely used in statistics, there many statistical process and terms that are closely associated with it. Most of these processes are methods to test the strengths of Crosstabs which is needed to maintain consistency and come up with accurate data because data being laid out using Crosstabs may come from a wide variety of sources. Companies find the services of a data warehouse very indispensable. But inside the data warehouse can be found billions of data which most of them are unrelated. Without the aid of tools, these data might not be of any use to the company. These data are not homogenous. They may come from various sources, often from other data suppliers and other warehouses which may be coming from other branches in other geographical locations. Software applications like relational database monitoring systems have Cross Tabulation functionalities which allow end users to correlate and compare any piece of data. Crosstab analysis engines can examine dozens of table very fast and efficiently and these engines can even create full statistical outputs by very clicks of the mouse or keyboards.

Name : Siddhartha Singh

Roll : 13165

Marketing Group 6

FREQUENCY AND CROSS TAB ANALYSIS:

FREQUENCY AND CROSS TAB ANALYSIS:

In order to go from data to information, to knowledge and to wisdom, we need to reduce the complexity of the data.

Analyzing data from online surveys is probably one of the most interesting aspects of the whole "Online Survey" experience. It is important to understand the "Numbers" before you can claim your research to be successful.

Some of the tools that make data analysis easy are:

1) Frequency analysis.

This gives you an "Overall" insight into the responses for your survey. General questions like:-

What is the % of people who responded to my survey are males?
What is the average age of people who responded to the survey?

2) Cross tabulation analysis or crosstab.

Crosstabs give you more insights into your data. Crosstabs answer questions like: -

What % of males made a purchase within the last 2 months?
Are males more satisfied with our products than females?

A Crosstab should never be mistaken for frequency distribution because the latter provides distribution of one variable only. A Cross Table has each cell showing the number of respondents which gives a particular combination of replies.

An example of Cross Tabulation would be a 3 x 2 contingency table. One variable would be age group which has three age ranges: 11-21, 22-30, and 31-up. Another variable would be the choice of Tommy Hilfiger Jeans or Cotton Pants. With a crosstab, it would be easy to for a company to see what the choices of Jeans are for the three age groups. For instance, the table would show that 35% of those aged 12-20 prefer Tommy Hilfiger Jeans, while only 10% of those aged 31-up prefer Cotton Pants. With the information, they can come up with moves which will be beneficial to the success of the business.

Cross Tabulations are popular choices for statistical reporting because they are very easy to understand and they are laid out in a clear format. They can be used with any level of data whether the data is ordinal, nominal, interval or ratio because the Crosstab will treat all of them as if they are nominal data. Crosstab tables provide more detailed insights to a single statistics in a simple way and they solve the problem of empty or sparse cells.

Companies find the services of a data warehouse very indispensable. But inside the data warehouse can be found billions of data which most of them are unrelated. Without the aid of tools, these data will not make any sense to the company. These data are not homogenous. They may come from various sources, often from other data suppliers and other warehouses which may be coming from other branches in other geographical locations.

Software applications like relational database monitoring systems have Cross Tabulation functionalities which allow end users to correlate and compare any piece of data. Crosstab analysis engines can examine dozens of table very fast and efficiently and these engines can even create full statistical outputs by very clicks of the mouse or keyboards.

Author : Rinzing

Group OPS 3

SPSS - Telecom's Boon!

SPSS Spreading its Magical Wings over Telecommunication Giants

SPSS is a comprehensive and flexible statistical analysis and data management solution. SPSS can take data from almost any type of file and use them to generate tabulated reports, charts, and plots of distributions and trends, descriptive statistics, and conduct complex statistical analyses to gain insight and drive business planning..

You will find SPSS customers in virtually every industry, including telecommunications, banking, finance, insurance, healthcare, manufacturing, retail, consumer packaged goods, higher education, government, and market research.

SPSS Statistics Base is easy to use and forms the foundation for many types of statistical analyses.

The procedures within SPSS Statistics Base will enable you to get a quick look at your data, formulate hypotheses for additional testing, and then carry out a number of statistical and analytic procedures to help clarify relationships between variables, create clusters, identify trends and make predictions.

SPSS and Telecommunication

Using predictive analytics from SPSS, telecom companies gain the insight they need to make better, faster, more effective decisions. By learning more about their customers, and those customers’ preferences and needs, telecom companies can be more successful in this highly competitive industry.

SPSS plays an important role in the telecommunication sector with respect to the various verticals-

1. Analytical customer relationship management (CRM)

2. Marketing and sales analysis

3. Segmentation management

4. Fraud detection

SPSS help reducing the customer churn, and acquire and retain customers

Predictive analytics enables telecom companies to develop more effective customer retention strategies by identifying both their “at risk” and most valuable customers. Telecom providers can:

1. Increase customer retention

2. Acquire profitable customers

3. Create more effective cross-selling and up-selling strategies

For example, by identifying the greatest number of customers likely to churn within a small percentage of the customer base, telecommunications companies can develop effective customer retention solutions and reduce costs. Companies can also identify their most profitable customers by value, as well as propensity to churn. With this knowledge, they can target the right customers with offers, such as a package that bundles DSL and long-distance services, to keep them loyal.

Develop more focused marketing and sales campaigns

Predictive analytics enables marketers in telecom companies to plan marketing programs and campaigns—and closely monitor results—using skills they already have. Marketers receive a complete, current view of their customers, and insight into customer attitudes and behaviour.

Target messages to the right customers

Understanding the similarities and differences among customers in specific geographic regions and demographic segments can make all the difference to telecom companies—especially if they serve a large and varied population. Predictive analytics can help companies more effectively customize their strategies, offerings, and campaigns by providing a clearer understanding of the common characteristics or behaviors of certain groups.

Segmenting enables a telecom company to create more precise campaigns, rather than sending general offers to the entire customer database. A company could look at customer segments to create targeted customer retention strategies for certain segments. For example, it might create a special campaign for small business customers located in suburban office parks that are likely to be lured by a competitor’s lower prices and extended local calling areas. Or, looking at geographic regions and demographic segments, the company might design a specific bundle for single women in the northern region, and another bundle for families with multiple lines in the northwest region.

Identify patterns common to fraud—to stop fraudulent activity

Addressing fraud is a challenge the telecom industry faces every day. Fraud detection and prevention can be very difficult, affect a diverse range of departments, and significantly strain resources. Predictive analytics can help telecom companies identify patterns that are common to fraud. Providers can easily detect and investigate possible cases of fraud, including unauthorized use of another subscriber’s minutes, billing fraud, and fraudulent payments. All this enables telecom companies to recoup more money and put a stop to fraudulent activity.

Understand what matters most to customers

Survey and market research software helps telecom companies gauge their customers’ opinions and analyze results. With SPSS’ solutions for market assessment and testing, telecom providers can:

1. Predict which customers are more likely to churn over time

2. Gauge satisfaction with regard to a particular service or rate plan

3. Evaluate the effectiveness of customer service

4. Understand which features and options encourage customer retention

5. Determine the likelihood a new logo or tagline will succeed in the marketplace

References-

http://www.spss.ch/eupload/File/PDF/Telecommunications%20Industry%20Brief.pdf

Gaurav Kumar

Group 2 - Marketing

Business analytics - Session 1

Analytics have been used in business for a very long time. There can be a lot of academic applications for analytics but the real application is in terms of business. One can find a lot of data , give their observations , interpret in academics the tough job is to have meaningful results which will have some business significance.
So as students of MBA we should not use SPSS just as an academic tool but learn how to use it for business.

We learnt that cross tab is bi-variate analysis . Bivariate analysis is one of the simplest forms of the quantitative (statistical) analysis. It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them. In order to see if the variables are related to one another, it is common to measure how those two variables simultaneously change together.

This can be applied to Operations Management . Like for e.g , we can check in a Warehouse , within the warehouse does placing of items based on value affect the production time .

Here we can have a size of the Warehouse and different locations within the warehouse. Some location will be close to the production line and some may be not . The items may also have have the facility of stacking over each other. So the inventory have to be moved to the production line after picking up from the warehouse.
So some amount of effort and time is consumed.Also the items vary by value based on which they may vary on weight and size. So location items based on value kept in warehouse may effect production time.

Analytics will help finding if it really does.

Posted by
Akash Sarkar (13004)
Operations Group 1

Source : Wikipedia

Business analytics - Session 1

Of CrossTabulations and Clusters

As we begin, let me teach you what clustering analysis is.

Cluster Analysis is a statistical tool that can be applied to data that exhibit natural groupings (sic). A cluster is a group of homogeneous or similar observations. Now there are two major ways of going about clustering which are as follows:

· Hierarchical Clustering: This type of clustering separates data into clusters in hierarchical order and represents them in a tree like structure called a dendrogram. This in turn can be divided into two different types.

o Agglomerative Clustering: A bottom up approach in which we start from an individual cluster and move up, building or forming clusters until we reach the apex with an overall cluster

o Divisive Clustering: A top down approach where we start with an individual cluster and break them down into individual clusters until they cannot be divided anymore

· K-Means Clustering: A form of non hierarchical clustering in which clusters are determined based on the centroid of the data sets. The main advantage of this is that it is simple and the most popular method of partitioning data.

In today's class, my second time working with a tool like SPSS, I found that data handling could be so easy. Excel of course is the name that comes to mind immediately when somebody talks about data analysis but just the first look suggests that the innumerable resources that SPSS provides can make sorting through data a cakewalk.

For example, today we worked with Frequencies and Crosstabs. Sounds simple doesn't it. Only that it doesn't seem so simple when you work with it. SPSS provides innumerable tools like Row and Column Percentages and Chi-Squared tests at the click of a button. Life made simple !

Come to think of it. A store manager might like to know why his sales are decreasing or increasing ( the optimist that I always am). Operating a store, he/she knows that thousands visit the store to buy different kinds of items. Not everybody can be asked for feedback and not everybody's voice is heeded. What they could do would be to look to SPSS and just chart a relationship between what could possibly be the reason behind a nagging pain to the company/store.

Coming to think of real life examples, I just look up and I see a router with its light beeping constantly. DO you know how the Internet in campus works. By Clustering. The Boys Wing and the Girls wings have different net connection settings. They are nothing but clusters. Moreover, they are nothing but K-Means clusters. Nobody is given preference. Everybody connects through the same Internet Gateway (has to do a little research to obtain this term).

So at the end of the day, clustering is all around you. It's in the way you make your own choices, how you make friends and so on.Of CrossTabulations and Clusters

Group name: Finance 3

Author: Anjali Meena

Identification of false financial statements through SPSS

An external stakeholder can determine whether a financial statement is falsified before using such a financial statement as a base for investing, appraising, and taking other decisions.

There has been a research paper presented in the American University of Nigeria, which testified the efficacy of ‘CPT Analyses’ model for identifying false financial statements using SPSS. The paper demonstrate that conducting and implementing the proposed ‘CPT’Analysesor detection of false financial statements will, undoubtedly be helpful to professionals such asauditors, forensic accountants, insolvency practitioners, tax authorities, investors, consultants, banksand other users of financial statements.

SPSS help organizations succeed in combating fraud, waste, and abuse. For example, our solutions enable organizations to:

Maximize tax revenues
Manage compliance issues and detect improper payments
Support anti-money-laundering efforts
Combat other forms of fraud, waste, and abuse

Another major field of contribution is the health care sector, where deliberately submitting false claims to private health insurance plans or tax-funded insurance programs such as Medicare or Medicaid, is a serious and growing nationwide crime trend.

Likewise, there can be many other areas where SPSS can serve in and help the processes become smother.

http://www.sybase.com/content/1025531/iws_fraud_l02341.pdf

http://www.spss.com.sg/software_and_solutions.aspx?pageid=337&secid=300

Posted by

Swati Agarwal

Finance

DAY 1 at a glance !!

SPSS software is used in data mining, analyzing quantitative data and for analyzing various variables that can affect a business. It is used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others.

Statistics included in the base software are as follows:

 Descriptive statistics: Cross tabulation, Frequencies, Descriptives, Explore, Descriptive Ratio Statistics

 Bivariate statistics: Means, t-test, ANOVA, Correlation (bivariate, partial, distances), Non-parametric tests

 Prediction for numerical outcomes: Linear regression

 Prediction for identifying groups: Factor analysis, cluster analysis (two-step, K-means, hierarchical), Discriminant.

(http://en.wikipedia.org/wiki/SPSS)

SPSS datasets have a table structure where the rows typically represent cases (such as individuals or households, which can be given names and labels) and the columns represent measurements (such as age, sex or household income, which are given values). The values, names or labels are provided by clicking on the variable view and then filling in.

The 'Data View' shows a spreadsheet view of the rows and columns. Unlike spreadsheets, the data cells can only contain numbers or text. The 'Variable View' displays the information or characteristics where each row represents a variable and shows the variable name, variable label, value, print width, measurement type and a variety of other characteristics.

When we want to understand the relation between one factor and another factor, Frequencies are analyzed . This is done by clicking on analyze tab and then clicking on descriptive statistics tab and then clicking on frequencies. This is done for single variate analysis.

For bivariate analysis, the cross tabulation tab is selected. This done by clicking on analyze tab, and then clicking descriptive statistics and then selecting cross tabs. The chi square can also be calculated by selecting ‘cells’ n selecting Pearson’s chi-square.

In the output window a complete table of both variables and its relations will appear. The factor on which the hypothesis is tested will be the row variable. And if the significant value in the chi-square test variable is less than 0.05, then the hypothesis is accepted.

Hence, data can be correlated and analyzed to conclude a hypothesis.

Cluster analysis is the process of grouping a set of observations together, in order to analyse similar data.

There are different kinds of analysis :

• Hierarchical clustering: find successive clusters using previously established clusters. These algorithms usually are either bottom-up or top-down)

• Divisive clustering (top-down): We start at the top with all documents in one cluster. The cluster is split using a flat clustering algorithm.

• Agglomerative clustering (bottom – up) : Bottom-up algorithms treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents

Clustering process involves:

• Selection of variables.

• Distance measurement

• Clustering criteria.

Dendograms:

One product of cluster analysis is a tree diagram representing the entire process of going from individual points to one big cluster. This diagram is called a dendrogram. A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are often used in business analytics to illustrate the clustering of similar variables or samples.

Deciding the number of clusters to map can be aided by looking at the dendrogram. There are three key pieces of information that you can get from the dendrogram. They are:

• Weight - the rough percentage of all individuals that fall within each cluster

• Compactness - how similar to one another the elements of a cluster are

• Distinctness - how different one cluster is from its closest neighbor

Hierarchical clustering dendogram would look like this:

Written by Anita Nair

Group: Marketing1

Fund Managers use SPSS to Select Stocks

Business analytics (BA) refers to the skills, technologies, applications and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning.BA makes extensive use of data, statistical and quantitative analysis, explanatory and predictive modeling, and fact-based management to drive decision making.

SPSS is a tool that is commonly used to drill down into various details of the business. It makes use of functions like “Frequency” and “Cross tab bivariate analysis” to dig deep into the various aspects of the business. This simplifies the decision making process by focusing on the key aspects which otherwise may not be very visible.

The Process starts with the Hypothesis where you want to find whether a relationship exists between two variables or not. As a finance person, I can think of innumerable possibilities of its applications.

SPSS uses different variables in the Data View tab. These are shown as columns. The cases exist in the rows. The values of different variable of a particular case are entered in the Data view tab. The different attributes of these variables can be changed in the Variable view tab.

Let us assume that we want to establish a relation between “Returns given by the stocks and its PE ratio over a long period of time (say 5years)”

Our Hypothesis will be “There is no significant relation between the Low P/E stocks (also called Value stocks) and the Returns given over a 5 year horizon”

The Cases will be its details including the name & the variable values. These variables will be the Returns, P/E ratio, Market capitalization, Book value-to-Price ratio.

We can select the range of P/E using the function “Recode into same Variable”. Here we can define a range for P/E and thus create a new variable. Each range is given some value while its labels can be changed to Value Stocks (Low PE) and Growth Stocks (High PE). Similarly, using the same function range can be given to the Returns. The labels can be- Very high returns, high returns, moderate returns, low returns and negative returns. PE values shall be in the rows while the Return labels should be in the columns. We should use the percentages to see what percentage of low PE stocks have Very high returns (or high returns) over the last 5 years.

Using Cross tab we can create further layers based on the Market Capitalization and the Book to Price Value ratio of the stock. These two layers will give us better understanding. Using the Chi-Square value (more or less than 5%) we can accept or reject our Null hypothesis. Suppose it is less than 5%, we reject our Null Hypothesis and there is a significant relationship with Low PE stock and high returns over a period of 5 years when say the Book-to-Price value ratio is high (the third variable introduced as Layer)

Therefore, a Smart Fund manager uses SPSS to pick up the stocks where he should invest (given his time horizon for investment is large enough). This way SPSS can prove to be extremely useful in stock picking. Such FUNDS give much superior returns than the Market returns.

Now we know the secret why some fund managers outperform the rest….!!!

By-

Vyom Saini

Finance Grp-6 (13114)

USING THE CROSS TAB TOOL FOR A LOAN REPAYMENT SCENARIO

Business Analytics to me before attending the lecture was about doing data analysis to meet certain objectives of the organization. Well, it was about that, but the application of the subject became clearer. The SPSS software shows how much this software can do to simplify the business analysis function. After a couple of analysis exercise that were done in class, the application of the software was made clear. There were a number of tools that we used today to do the data analysis like the chi-square test, frequency check which is a part of the descriptive analysis. One of the analysis tools used is called the cross tab. It is a bivariate analysis tool which gives the relationship between two variables, thus throwing some light on the cause and effect relationship. This is a tool which can be used by banks to analyse the default in the loan repayment by the borrowers of a bank. Say, if a bank has 10 branches in a city. The amount of default at each branch can be checked with the frequency function. The branch where the maximum amount of default has happened can be analysed further. The reason for high level of default can be multiple, like the amount of loan given being higher than the borrower’s capability to borrow or because of faulty loan collection mechanism. Therefore at branch level, further analysis can be done using the cross tab tool. It can be seen as to which customers are defaulting, retail or corporate. If the retail customers are defaulting more, we can establish a relationship between the income levels of the defaulting customers and the amount of loans granted to them using the same function. There can be a possibility that the loans granted to these customers can be higher as compared to their repayment capability. The loan granted to the customers of all the branches with the same income level can be checked. If the loan granted to such customers at the particular branch analysed is more than its repayment capacity, we can know that the fault is with the branch management for inflating the loan size.

Therefore, this analysis tool can be used in varied fields to meet different business objectives.

By:

Sakshi Tripathi (13101)

Group: Finance6

Types of clustering:

Hierarchical clustering

Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure called a dendrogram. The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual observations. Hierarchical clustering can be either agglomerative or divisive.

Agglomerative: This is a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. In agglomerative clustering, once a cluster is formed, it cannot be split; it can only be combined with other clusters. Agglomerative hierarchical clustering doesn’t let cases separate from clusters that they’ve joined.

Divisive: A top-down clustering method and is less commonly used. It works in a similar way to agglomerative clustering but in the opposite direction. This method starts with a single cluster containing all objects, and then successively splits resulting clusters until only clusters of individual objects remain.

To form clusters using a hierarchical cluster analysis, you must select:

· A criterion for determining similarity or distance between cases

· A criterion for determining which clusters are merged at successive steps

· The number of clusters you need to represent your data

Types of clusters:

· Well separated clusters: Each cluster point is closer to all of the points in its cluster than to any point in another cluster

· Centre based clusters: Each point is closer to the centre of its cluster than to centre of any other cluster

· Contiguity based clusters: Each point is closer to at least one point in its cluster than to any point in another cluster

· Density based cluster: Clusters are regions of high density separated by regions of low density

· Conceptual clusters: Points in a cluster share some general property that derives from the entire set of points.

Distance between Cluster Pairs:

These methods define the distance between two clusters at each stage of the clustering procedure:

Nearest neighbor (single linkage): If you use the nearest neighbor method to form clusters, the distance between two clusters is defined as the smallest distance between two cases in the different clusters.

Furthest neighbor (complete linkage): If you use a method called furthest neighbor (also known as complete linkage), the distance between two clusters is defined as the distance between the two furthest points.

Ward’s method: For each cluster, the means for all variables are calculated. Then, for each case, the squared Euclidean distance to the cluster means is calculated. These distances are summed for all of the cases. At each step, the two clusters that merge are those that result in the smallest increase in the overall sum of the squared within-cluster distances. The coefficient in the agglomeration schedule is the within-cluster sum of squares at that step, not the distance at which clusters are joined.

Centroid method: This method calculates the distance between two clusters as the sum of distances between cluster means for all of the variables. In the centroid method, the centroid of a merged cluster is a weighted combination of the centroids of the two individual clusters, where the weights are proportional to the sizes of the clusters.

Median method: With this method, the two clusters being combined are weighted equally in the computation of the centroid, regardless of the number of cases in each. This allows small groups to have an equal effect on the characterization of larger clusters into which they are merged.

K-means clustering :

This is the most popular partitioning method. In this we have to specify the number of clusters to extract. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function.

References

http://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf

http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Agglomerative_Hierarchical_Clustering_Overview.htm

Group: Marketing 6

Author: Soupa Soundararajan (13109)