
An Introduction to Data

Data Dimensionality and Space

Proximity in Data Science Context

Clustering algorithms
Nearest neighbors
Since data forms space and we are already familiar with distance and similarity measures, we can find out points nearby a point. Given a data point, finding several closest points is called the computation of the nearest neighbors. Finding k nearest data points of a given one is know as the knn problem. knn stands for knearest neighbors.
Problem statement for knn
Formally, the knn problem can be written as:
Given a vector and a data matrix , order all vectors of such that and . Return the first vectors where .
For numerical objects, the length of must be equal to the number of features/dimensions in to be able to use a distance or similarity measure.
From the programming perspective, knn returns the indices (row serial number) of the top knearest neighbors, instead of returning k complete vectors from the data.
The distance function can be any proximity measuring function that we are already familiar with from the previous lessons. You might need a distance measure for certain applications; for other applications, a similarity measure might be alright.
Example of knn
The table on the right side of the following figure has eight rows (objects) and two columns (features.) A point =(13, 17) is provided. We are attempting to find 4 nearest neighbors of in the data table. After computing distances between and each of the points, we found that Row 5 is the nearest point to . The second nearest point is Row 6 of the table. The third nearest point is Row 3. Row 8 contains the fourth nearest neighbor of . A 2D depiction of the points in space and the nearest neighbors are shown in the figure.
The process to compute the nearest neighbors of a given point
To compute the k nearest neighbors of the given point =(13, 17) in the table above, we first computed the Euclidean distance of with each row of the data. The data point (or row) that has the smallest distance is the first nearest neighbor; the data point with the second smallest distance is the second nearest neighbor. When k=4, we select the rows for the first, second, third, and fourth in ascending order of the computed distance with .
The calculations are shown below.
For the example above, knn will return an array with content [5, 6, 3, 8], which indicates that Row 5 is the first nearest neighbor, Row 6 is the second nearest neighbor, Row 3 is the third nearest neighbor, and Row 8 is the fourth nearest neighbor.
Remarks
We used examples of twodimensional data on this page because it is easy to visualize twodimensional space. In reality, the data can have any number of dimensions. To use Euclidian distance for knn, the given vector must have the same number of dimensions.
To find knn from string data, given a string one can compute the edit distance between the given string and each string object in the data.
Leave a Reply
Want to join the discussion?Feel free to contribute!