Distance measures

Given a pair of vectors (data points, or objects, or rows of a table), we can use some existing distance measures to compute how different or similar the vectors are. We will start with a distance measure that we are already familiar with from geometry — the Euclidean distance.

You might recall from geometry that the distance between two points $A(x_1, y_1)$ and $B(x_2, y_2)$ is equal to $\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}$ . The distance between two rows of a data table can be computed using the same Euclidean distance formula because tabular data forms a Euclidean space, as discussed before.

For example, consider Row 2 and Row 5 of the following table. Row 2 has (5, 4), and Row 5 has (9, 2). Therefore, the distance between Row 2 and Row 5 is equal to $\sqrt{(5-9)^2+(4-2)^2}=4.472135954999579$

	Feature 1	Feature 2
Row 1	10	3
Row 2	5	4
Row 3	10	4
Row 4	8	6
Row 5	9	2

What happens when we have a three-dimensional dataset like the following one, instead of a two-dimensional dataset?

	Feature 1	Feature 2	Feature 3
Row 1	10	3	3
Row 2	5	4	5
Row 3	10	4	6
Row 4	8	6	2
Row 5	9	2	1

The principle of Euclidean distance is the same for a higher-dimensional space — the square root of the summation of squared differences in each dimension. That is, for a three-dimensional dataset or space, the distance between two data points $A(x_1, y_1, z_1)$ and $B(x_2, y_2, z_2)$ is equal to $\sqrt{(x_1-x_2)^2+(y_1-y_2)^2+(z_1-z_2)^2}$ .

Row 2 of the table above contains (5, 4, 5) and Row 5 contains (9, 2, 1). The Euclidean distance between Row 2 and Row 5 of the data above is: $\sqrt{(5-9)^2+(4-2)^2+(5-1)^2}=6$

What happens if you have n-dimensional data?

	Feature 1	Feature 2	Feature 3	Feature 4	—	Feature n
Row 1	10	3	3	5	—	10
Row 2	5	4	5	3	—	6
Row 3	10	4	6	4	—	9
Row 4	8	6	2	6	—	3
Row 5	9	2	1	2	—	11

Euclidean distance

Here is the general formula to compute the Euclidean distance between two data points or vectors, $A(f_1, f_2, f_3, f_4, \ldots, f_n)$ and $B(g_1, g_2, g_3, g_4, \ldots, g_n)$ , of a n-dimensional dataset:

$\sqrt{(f_1-g_1)^2+(f_2-g_2)^2+(f_3-g_3)^2+(f_4-g_4)^2+\ldots+(f_n-g_n)^2}$

The formula can be generalized further:

$\sqrt{\sum_{i=1}^{n}{(f_i-g_i)^2}}$

This formula to compute the distance between two points or vectors $A(f_1, f_2, f_3, f_4, \ldots, f_n)$ and $B(g_1, g_2, g_3, g_4, \ldots, g_n)$ is sometimes written with two double bars at left and right. The double bars generally indicate an L²-norm of the vector inside the bars.

Euclidean distance between $A$ and $B$ =

$\left \| A-B \right \|=\sqrt{\sum_{i=1}^{n}{(f_i-g_i)^2}}$

Definitely, Euclidean distance satisfies the metric-properties or the four distance properties that we discussed earlier.

Manhattan distance

A simpler difference in every dimension can be computed without squaring the differences and without using the square root. That is, just sum up the differences between the two vectors in every dimension. The measure that just sums up the differences in each dimension of two points or vectors $A(f_1, f_2, f_3, f_4, \ldots, f_n)$ and $B(g_1, g_2, g_3, g_4, \ldots, g_n)$ is called the Manhattan distance. Manhattan distance between two vectors $A$ and $B$ is sometimes denoted using a lone-bar at the left and right of the difference of the vectors. The right bar has a subscript of 1, indicating it is an L¹-norm.

Manhattan distance between $A$ and $B$ =

$\left | A-B \right |_1=\sum_{i=1}^{n}{\left |f_i-g_i \right |}$

As an example, let us assume that we have a four-dimensional dataset. A pair of vectors in this dataset are: A=(5, 4, 9, 2) and B(4, 9, 2, 7).

Therefore, the Manhattan distance between A and B=

$\left | A-B \right |_1=|5-4|+|4-9|+|9-2|+|2-7|=18$

Manhattan distance satisfies the four distance properties that we discussed earlier.

What happens when there are strings

Things are simpler when data contains only numbers. Things become complicated when you have strings or text in your data cells. It might be difficult to imagine a Euclidean space with string data. Even if we cannot imagine a Euclidean space with strings, if we can come up with a distance measure that satisfies the four distance properties, then we have a Euclidian space.

In this document, we discuss two distance measures — hamming distance and edit distance — for strings.

Hamming distance

Hamming distance between two strings is the number of positions where the strings have different letters. Obviously, to be able to compute the Hamming distance between two strings, the strings must have the same length.

As an example, consider that we are computing the distance between the following two strings.

apricot

abrikop

The two strings are different in positions 2, 5, and 7. That is, they are different in three positions. Therefore,

Hamming distance (apricot, abrikop)=3

Hamming distance is sometimes used to compute the distance between binary bit vectors of the same length. Sometimes, binary streams are repeated during transmission to increase reliability. If the original binary message and the repeated one have a non-zero edit distance, it is confirmed that some bits were corrupted during the transmission. Of course, if both the bits in the same position are changed to the same bit, then the Hamming distance will not be able to capture the corruption.

Hamming distance is a true distance measure. That is, it satisfies the four distance properties.

Edit distance (LCS-based)

The edit distance between two strings refers to the minimum number of edits required to convert one string to the other. What do we mean by the “number of edits?” Edit refers to one of the two following tasks: delete or insert. That is, the edit distance between two strings is the minimum number of delete and insert operations required to convert one string to the other.

Please note that given an infinite number of edits it is always possible to convert one string to another. We are talking about the “minimum” number of edits.

As an example, consider that we have two strings S₁ and S₂.

S₁=xyzmn

S₂=yzmopn

Let us try to convert S₁ to S₂.

We have, S₁=xyzmn

Delete x from S₁. S₁ is now yzmn.

Insert o in the fourth position. S₁ now becomes yzmon.

Insert p in the fifth position. S₁ now becomes yzmopn.

Notice that S₁ has become S₂=yzmopn.

To convert S₁ to S₂, we needed three edits (one delete and two insertions.) We cannot do this conversion with any lesser edits than 3.

Therefore, the edit distance between xyzmn and yzmopn is 3.

Edit distance is a true distance measure. Therefore, it holds all the four axioms required in a distance formula. As a result,

edit distance (S₁, S₂)=edit distance (S₂, S₁)

If you attempt to convert yzmopn to yzmon you will see that you need a minimum of 3 edits. (Please do it as an exercise.)

There is an alternative way to compute the edit distance. The alternative relies on the length of the longest common subsequence (LCS). The formula is:

edit distance(S₁, S₂) = |S₁| + |S₂| – 2 |LCS(S₁, S₂)|

|S₁| or |S₂| indicates the length of the corresponding string. |LCS(S₁, S₂)| is the length of the longest common subsequence between S₁ and S₂.

LCS refers to the longest sequence of letters that are present in both the strings from left to right. Let us take our example strings S₁=xyzmn and S₂=yzmopn. Notice that the yzm is present in both the strings from left to right. However, yzm is not the longest sequence that is present in both the strings from left to right. The longest sequence that is present in both the strings from left to right is yzmn.

Note here that to be a subsequence, the subsequence need not be a contiguous substring. The letters of the subsequence may appear in the same sequence in a given string. For example, although yzmn is a substring in |S₁|, yzmn does not appear as a substring in |S₂|. In |S₂|, yzm appears at the beginning of the string and n appears at the end. Still, yzmn is considered a subsequence of |S₂|. That is, a substring is a subsequence but a subsequence may not be a substring.

Therefore,

edit distance(S₁, S₂) = edit distance(xyzmn, yzmopn)= |xyzmn| + |yzmopn| – 2 |yzmn|=5+6-2*4=3

Notice that the LCS-based formula is giving us the same edit distance as before for the given pair of strings.

The LCS-based formula for edit distance is widely used because there are existing algorithms in the literature to compute LCS. Hence, computing edit distance is not difficult when someone uses an existing algorithm to compute LCS.

15 Comments

KANAFISIMI MIKAILU

August 28, 2023 4:31 am

Hello Sir
The edit distance between S1=banana & S2, cbnaa
=3

Kimia

March 18, 2021 11:23 am

Hello,

S1=banana
S2=cbnaa
The edit distance (banana,cbnaa)= 6+5_(2*4) =11_8=3
we have to,
delete “c”
and
insert “a” & “n”

I hope i got it right

M. Shahriar Hossain

March 21, 2021 4:38 pm

That is correct! Thank you.

Farida

February 15, 2021 1:29 pm

Thank you for such a explanation

M. Shahriar Hossain

February 16, 2021 12:38 pm

I am glad to know that the explanation was useful. Thank you for visiting the page.

houam

December 6, 2020 4:02 pm

Hi,
Euc_D = 7.874008
Manh_D = 16
Edit_D(S1, S2) = 3
Thanks a lot for your excellent explanation.

Shahriar

December 6, 2020 4:57 pm

These are all correct answers. Thank you for summarizing all the answers here.

Lotfi Houam

December 6, 2020 3:55 pm

Hi,
Edit_D(s1, s2) = 3.
Thanks a lot for your excellent explanation.

Shahriar

December 6, 2020 4:55 pm

That is correct! Thank you for visiting and providing your feedback.

Lotfi Houam

December 6, 2020 3:35 pm

Eucd = 7.874008

Manhd = 16

Shahriar

December 6, 2020 4:55 pm

These are correct answers. Thank you!

Mahmoudidirs

April 23, 2020 9:48 am

the Euclidean distance between raw 2 and raw 5 is 7.87
while the Manhattan distance is 16
….
while this is intuitive, I would love to know how this math came out because I don’t think I fully understand.
thank you very much for the knowledge you’re sharing.

Shahriar

April 23, 2020 1:52 pm

Thank you for your comment. I did not explain more because it will increase the complexity of the video. It is great to hear that you are interested to know more about it. In practice, the distance measures discussed on this page are coming from geometry. The following wikipedia page will give you a little more explanation: https://en.wikipedia.org/wiki/Taxicab_geometry.

Thank you. Have a wonderful day.

Marc Edval

March 31, 2020 7:26 pm

Hello,
The euclidean distance between row2 & row5 = 7.874
Manhattan distance between row2 & row5 = 16
Edit distance (S1,S2) = 3

Shahriar

April 2, 2020 1:44 pm

All of the three answers are correct! Thank you for solving them and sending the answers. Best regards.

Distance measures

Euclidean distance

Manhattan distance

What happens when there are strings

Hamming distance

Edit distance (LCS-based)

15 Comments

Leave A Reply Cancel reply

Login with your site account

Register a new account

Modal title