An Introduction to Data
Data Dimensionality and Space
Proximity in Data Science Context
Since data forms space and we are already familiar with distance and similarity measures, we can find out points nearby a point. Given a data point, finding several closest points is called the computation of the nearest neighbors. Finding k nearest data points of a given one is know as the knn problem. knn stands for k-nearest neighbors.
Problem statement for knn
Formally, the knn problem can be written as:
Given a vector and a data matrix , order all vectors of such that and . Return the first vectors where .
For numerical objects, the length of must be equal to the number of features/dimensions in to be able to use a distance or similarity measure.
From the programming perspective, knn returns the indices (row serial number) of the top k-nearest neighbors, instead of returning k complete vectors from the data.
The distance function can be any proximity measuring function that we are already familiar with from the previous lessons. You might need a distance measure for certain applications; for other applications, a similarity measure might be alright.
Example of knn
The table on the right side of the following figure has eight rows (objects) and two columns (features.) A point =(13, 17) is provided. We are attempting to find 4 nearest neighbors of in the data table. After computing distances between and each of the points, we found that Row 5 is the nearest point to . The second nearest point is Row 6 of the table. The third nearest point is Row 3. Row 8 contains the fourth nearest neighbor of . A 2D depiction of the points in space and the nearest neighbors are shown in the figure.
The process to compute the nearest neighbors of a given point
To compute the k nearest neighbors of the given point =(13, 17) in the table above, we first computed the Euclidean distance of with each row of the data. The data point (or row) that has the smallest distance is the first nearest neighbor; the data point with the second smallest distance is the second nearest neighbor. When k=4, we select the rows for the first, second, third, and fourth in ascending order of the computed distance with .
The calculations are shown below.
For the example above, knn will return an array with content [5, 6, 3, 8], which indicates that Row 5 is the first nearest neighbor, Row 6 is the second nearest neighbor, Row 3 is the third nearest neighbor, and Row 8 is the fourth nearest neighbor.
We used examples of two-dimensional data on this page because it is easy to visualize two-dimensional space. In reality, the data can have any number of dimensions. To use Euclidian distance for knn, the given vector must have the same number of dimensions.
To find knn from string data, given a string one can compute the edit distance between the given string and each string object in the data.