• ##### Clustering algorithms
No items in this section

## Introductory Data Science

### Similarity measures

The high similarity between a pair of points indicates that the points are nearby. Low similarity indicates a large distance. Literature covers several similarity measures.

## Jaccard index or Jaccard coefficient

Jaccard index/coefficient/similarity is generally computed between two sets of items. It is a ratio of commonality between the sets over all the items. If X and Y are two sets, then the Jaccard index between two sets is computed using the ratio of the size of the intersection and the size of the union of the two sets. $\text{Jaccard}(X, Y)=\frac{\left |X\cap Y\right |} {\left |X\cup Y\right |}$

If X={a, b, c}, and Y={b, c, d, e} then, the size of the intersection between X and Y is: $|X\cap Y|= |\left \{ b, c \right \}|=2$

and the size of the union of X and Y is: $|X\cup Y|= |\left \{ a, b, c, d, e \right \}|=5$

Therefore, for given X={a, b, c} and Y={b, c, d, e}: $\text{Jaccard}(X, Y)=\frac{\left |X\cap Y\right |} {\left |X\cup Y\right |}=\frac{2}{5}=0.4$

Jaccard similarity varies between 0 to 1. A value of zero indicates no similarity between the two sets at all. A value of 1.0 indicates that the two sets are the same.

## Weighted Jaccard index/coefficient/similarity

Jaccard index can be computed between two vectors too. Jaccard index computed between two vectors/data points/objects is called a weighted Jaccard index. Given X and Y — two vectors each of length n — the formula for weighted Jaccard index or similarity between them is: $\text{Jaccard}(X, Y)=\frac{\sum_{k=1}^{n}\text{min}(X_k, Y_k)}{\sum_{k=1}^{n}\text{max}(X_k, Y_k)}$

Suppose, we have a four dimensional dataset (Features 1 through 4).

 Feature 1 Feature 2 Feature 3 Feature 4 Row 1 10 3 3 5 Row 2 5 4 5 3 Row 3 9 4 6 4 Row 4 8 6 2 6 Row 5 20 15 10 20

Let us compute the Jaccard similarity between Row 1 and Row 3.

Row 1 contains (10, 3, 3, 5). Row 3 contains (9, 4, 6, 4).

Weighted Jaccard similarity between Row 1 and Row 3 is: $\text{Jaccard}(\text{Row 1}, \text{Row 3})=\frac{9+3+3+4}{10+4+6+5}=\frac{19}{25}=0.76$

Let us compute the Jaccard similarity between Row 1 and Row 5.

Row 1 contains (10, 3, 3, 5). Row 5 contains (20, 15, 10, 20).

Weighted Jaccard index between Row 1 and Row 5 is: $\text{Jaccard}(\text{Row 1}, \text{Row 5})=\frac{10+3+3+5}{20+15+10+20}=\frac{21}{65}=0.323076923$

That means Row 3 is more similar to Row 1 than Row 5.

Weighted Jaccard similarity may vary between 0 and 1.0. A value of 1.0 indicates that the two vectors are the same. A value of 0 indicates no similarity between the two vectors.

Notice that the set-based Jaccard similarity we discussed earlier in this lesson is a special case of weighted Jaccard similarity — in the set-based Jaccard similarity, the weight of an item (feature) can be either 1 (present) or 0 (absent.)

## Cosine similarity

Cosine similarity between two vectors X and Y is computed using the following formula. $\text{cosine}(X, Y)=\frac{X.Y}{\left \| X \right \| \left \| Y \right \|}$

X.Y is the dot product of two vectors, each of length n. $X.Y=\sum_{k=1}^{n}\left ( X_k\times Y_k \right )$

||X|| refers to the L2-norm of a vector X that has a length of n. $||X||=\sqrt{\sum_{k=1}^{n}(X_k)^2}$

Hence, $||Y||=\sqrt{\sum_{k=1}^{n}(Y_k)^2}$.

Example: Consider Row 1 and Row 3 of the four-dimensional data table above. Row 1 contains (10, 3, 3, 5) and Row 3 contains (9, 4, 6, 4). What is the cosine similarity between Row 1 and Row 3? $\text{cosine(Row 1, Row 3)}\\ =\frac{10\times 9+3\times 4+3\times 6+5\times 4} {\sqrt{10^2+3^2+3^2+5^2}\times\sqrt{9^2+4^2+6^2+4^2}}\\ =\frac{140}{145.969174828}\\ =0.95910660702$

Now, compute the cosine similarity between Row 1 and Row 5. Row 1 contains (10, 3, 3, 5). Row 5 contains (20, 15, 10, 20). The cosine similarity should be 0.93494699.

Therefore, Row 3 is more similar to Row 1 than Row 5.

Cosine similarity may vary between -1 to 1. However, it is widely used in the positive space. In the positive space, the similarity varies between 0 to 1. It is specially used in the positive space for document datasets where document vectors have positive numerical values. Hopefully, we will discuss more on document-vectors in the future.

Also, note that if one of the vectors, when computing cosine similarity, contains all zeros, cosine similarity will give a division-by-zero error. An assumption here is that the origin (0, 0, …, 0) is not a data point in the data.

## Tanimoto coefficient/index/similarity

Tanimoto similarity between two vectors X and Y is computed using the following formula. $\text{Tanimito}(X, Y)=\frac{X.Y}{\left \| X \right \|^2 \left \| Y \right \|^2 -X.Y}$

Example: Consider Row 1 and Row 3 of the four-dimensional data table that we have been using in this lesson. Row 1 contains (10, 3, 3, 5) and Row 3 contains (9, 4, 6, 4). What is the Tanimoto index between Row 1 and Row 3? $\text{Tanomoto(Row 1, Row 3)}\\ =\frac{10\times 9+3\times 4+3\times 6+5\times 4} {(\sqrt{10^2+3^2+3^2+5^2})^2\times(\sqrt{9^2+4^2+6^2+4^2})^2-(10\times 9+3\times 4+3\times 6+5\times 4)}\\ =\frac{140}{143\times 149-140}\\ =0.006614069$

Now, compute the Tanimoto similarity between Row 1 and Row 5. Row 1 contains (10, 3, 3, 5). Row 5 contains (20, 15, 10, 20). The Tanimoto similarity should be 0.002336449 (please do the calcuation as a practice.)

Row 1 has a larger Tanimoto similarity with Row 3 than Row 5. Therefore, Rows 1 and 3 are more similar than Rows 1 and 5.

## Concluding remarks on the similarity

Our focus in this lesson was similarity measures between two vectors (and also two sets.) Any data that can be represented in tables can leverage the similarity measures explained in this lesson. Many other similarity measures may exist for different types of data. For example, there are graph similarity measures for graph data. Time series data may have other similarity measures too.

Why are we discussing distance and similarity measures? Distance or similarity measures are the core of many data science and artificial intelligence algorithms. We will use some of the distance and similarity measures in some of the algorithms, in this course.

0 replies