Monday, 29 August 2011

An Insight Into Jaccard Measure

The Jaccard index, popularly known as the Jaccard similarity coefficient is a static used for comparing the similarity and diversity of sample sets. The Jaccard coefficient which measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

J (A, B) = |A Ώ B|

| A U B|

The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:

J’ (A,B)=1- J(A.B) = | A U B|-|A Ώ B|

| A U B|

Jaccard’s coefficient (measure similarity) and Jaccard’s distance (measure dissimilarity) are measurement of asymmetric information on binary and non binary variables.

Example: Similarity of asymmetric binary attributes

Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:

M11 represents the total number of attributes where A and B both have a value of 1.

M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.

M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.

M00 represents the total number of attributes where A and B both have a value of 0.

Each attribute must fall into one of these four categories, meaning that

M11 + M 01+ M10 + M00 =n

Jaccard similarity coefficient J is given by

J= M11

M11 + M 01+ M10

Jaccard distance J’ is given by

J’ = M 01+ M10

M11 + M 01+ M10

Sakshi Goel

13100

Finance Group-1