Introductory Data Science

Nearest neighbors

Since data forms space and we are already familiar with distance and similarity measures, we can find out points nearby a point. Given a data point, finding several closest points is called the computation of the nearest neighbors. Finding k nearest data points of a given one is know as the knn problem. knn stands for k-nearest neighbors.

Problem statement for knn

Formally, the knn problem can be written as:

Given a vector x and a data matrix D, order all n vectors of D such that D=\{x_1, x_2, x_3, \ldots, x_n\} and \text{distance}(x, x_i)\leq\text{distance}(x, x_{i+1}). Return the first k vectors S=\{x_1, x_2, x_3, \ldots, x_k\} where k\leq n

For numerical objects, the length of x must be equal to the number of features/dimensions in D to be able to use a distance or similarity measure.

From the programming perspective, knn returns the indices (row serial number) of the top k-nearest neighbors, instead of returning k complete vectors from the data.

The distance function can be any proximity measuring function that we are already familiar with from the previous lessons. You might need a distance measure for certain applications; for other applications, a similarity measure might be alright.

Example of knn

The table on the right side of the following figure has eight rows (objects) and two columns (features.) A point x=(13, 17) is provided. We are attempting to find 4 nearest neighbors of x in the data table. After computing distances between x and each of the points, we found that Row 5 is the nearest point to x. The second nearest point is Row 6 of the table. The third nearest point is Row 3. Row 8 contains the fourth nearest neighbor of  x. A 2D depiction of the points in space and the nearest neighbors are shown in the figure.

k nearest neighbor

The process to compute the nearest neighbors of a given point x

To compute the k nearest neighbors of the given point x=(13, 17) in the table above, we first computed the Euclidean distance of x with each row of the data. The data point (or row) that has the smallest distance is the first nearest neighbor; the data point with the second smallest distance is the second nearest neighbor. When k=4, we select the rows for the first, second, third, and fourth in ascending order of the computed distance with x

The calculations are shown below.

Computing knn

For the example above, knn will return an array with content [5, 6, 3, 8], which indicates that Row 5 is the first nearest neighbor, Row 6 is the second nearest neighbor, Row 3 is the third nearest neighbor, and Row 8 is the fourth nearest neighbor.

Remarks

We used examples of two-dimensional data on this page because it is easy to visualize two-dimensional space. In reality, the data can have any number of dimensions. To use Euclidian distance for knn, the given vector x must have the same number of dimensions.

To find knn from string data, given a string one can compute the edit distance between the given string and each string object in the data.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *