An Introduction to Data
Data science is a field of study that focuses on techniques and algorithms to extract knowledge from data. The area combines data mining and machine learning with data-specific domains. This section focuses on defining "data" before going to any complicated topic.4
Data Dimensionality and Space
This section's focus is on defining the common terminology widely used in data science. The video lectures in this section focus on terms like objects, data points, features, dimensions, vectors, high-dimensional data, and mathematical space.6
Proximity in Data Science Context
Many data mining and machine learning algorithms rely on distance or similarity between objects/data points. Video lectures in this section focus on standard proximity measures used in data science. The section also explains how to use proximity measures to examine the neighborhood of a given point.5
A large portion of data science focuses on exploratory analysis. Scientists and practitioners use statistical techniques to understand the data. One way to explore the data is to check if there are clusters of data points. A cluster is a group of data points that have similar features. This section explains the clustering algorithms.7
k-nearest neighbors: Python code
Since data forms space and we are already familiar with distance and similarity measures, we can find points near a given point. Finding points near a point is called computing nearest neighbors. Given a data point, finding k closest points is called the computation of k-nearest neighbors. Finding k-nearest neighbors is also known as computing the knn. This article contains Python code from scratch to compute knn. Additionally, it provides an example of computing knn using the machine learning package scikit-learn in Python.
Problem statement for knn
Formally, the knn problem can be written as:
Given a vector and a data matrix , order all vectors of such that and . Return the first vectors where .
For numerical objects, the length of must be equal to the number of features/dimensions in to be able to use a distance or similarity measure.
To make a program efficient, knn returns the indices (row serial number) of the top k-nearest neighbors, instead of returning k complete vectors from the data.
The distance function can be any proximity function that we are already familiar with from the previous lessons. You might need a distance measure for certain applications; for other applications, a similarity measure might be alright.
Example of knn
The table on the left side of the following figure has eight rows (objects) and two columns (features.) A point =(13, 17) is provided. We are attempting to find 4 nearest neighbors of in the data table. After computing distances between and each of the points, we found that Row 5 is the nearest point to . The second nearest point is Row 6 of the table. The third nearest point is Row 3. Row 8 contains the fourth nearest neighbor of . A 2D depiction of the points in space and the nearest neighbors are shown in the figure.
The process to compute the nearest neighbors of a given point
To compute the k nearest neighbors of the given point =(13, 17) in the table above, we first computed the Euclidean distance of with each row of the data. The data point (or row) that has the smallest distance is the first nearest neighbor; the data point with the second smallest distance is the second nearest neighbor. When k=4, we select the rows for the first, second, third, and fourth in ascending order of the computed distance with .
The calculations are shown below.
For the example above, knn will return an array with content [5, 6, 3, 8], which indicates that Row 5 is the first nearest neighbor, Row 6 is the second nearest neighbor, Row 3 is the third nearest neighbor, and Row 8 is the fourth nearest neighbor.
We used examples of two-dimensional data above because it is easy to visualize two-dimensional space. In reality, the data can have any number of dimensions. To use Euclidian distance to compute k-nearest neighbors, the given vector must have the same number of dimensions.
In the examples with the Python codes below, we use higher-dimensional data.
To find knn from string data, given a string one can compute the edit distance between the given string and each string object in the data.
Python coding to compute k-nearest neighbors
Given a vector, we will find the row numbers (IDs) of k closest data points. We will compute k-nearest neighbors–knn using Python from scratch.
We will create the dataset in the code and then find the nearest neighbors of a given vector. In the example, our given vector is Row 0.
knn from scratch using Python
Here is the code. The function name is nearest. It uses two other functions. One is euclid to compute the Euclidean distance between two data points or vectors. The other one is sortkey, which is the comparator for sorting indices based on distance.
#knn python code import numpy as np def euclid (vec1, vec2) : euclidean_dist = np.sqrt(np.sum((vec1-vec2)**2)) return euclidean_dist def sortkey (item): return item def knearest (vec, data, k): result= for row in range(0, len(data)): distance=euclid(vec, data[row]) result.append([row, distance]) sortedResult= sorted(result, key=sortkey) indices= if k<len(data): for r in range(0, k): indices.append(sortedResult[r]) else: indices = [i for i in sortedResult] return indices # 7 Data points with each has 5 features data = np.array([[10,3,3,5,10], [5,4,5,3,6], [10,4,6,4,9], [8,6,2,6,3], [10,3,3,5,8], [9,2,1,2,11], [9,3,1,2,11]]) referenceVec = data; # We will find knn of Row 0 # Find 4 nearest neighbors of the reference vector k=4 knn = knearest(referenceVec, data, k) print("Row IDs of ", k, ' nearest neighbors:') print(knn)
Once the Python script is executed, the output will be the following.
Row IDs of 4 nearest neighbors: [0, 4, 2, 6]
Since we are finding the nearest neighbor of Row 0, the first nearest neighbor is Row 0 itself. Row 4, Row 2, and Row 6 are the second, the third, and the fourth nearest neighbors respectively.
knn using Python scikit-learn
scikit-learn on Python already has a function for computing k-nearest neighbors more efficiently using special data structures such as the ball tree. Here is an example of how we may use the NearestNeighbors class to find the nearest neighbors.
import numpy as np from sklearn.neighbors import NearestNeighbors # 7 Data points with each has 5 features data = np.array([[10,3,3,5,10], [5,4,5,3,6], [10,4,6,4,9], [8,6,2,6,3], [10,3,3,5,8], [9,2,1,2,11], [9,3,1,2,11]]) # Find 4 nearest neighbors of the reference vector k=4 # Reference vector ReferenceVec=data ## Using sklearn to find knn nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(data) distances, indices = nbrs.kneighbors([ReferenceVec]) # "distances" contains the nearest distance values for all k points # "indices" contains the indices of the k nearest points print("Row IDs of ", k, ' nearest neighbors:') print(indices) print("Distances of these ", k, ' nearest neighbors:') print(distances)
The code shows how we can find the nearest neighbors of Row 0 (ReferenceVec). The indices variable provides the indices of the nearest neighbors of Row 0. The distances variable provides the distance between Row 0 and each of the rows in indices. The output of the program is provided below.
Row IDs of 4 nearest neighbors: [[0 4 2 6]] Distances of these 4 nearest neighbors: [[0. 2. 3.46410162 3.87298335]]