
An Introduction to Data

Data Dimensionality and Space

Proximity in Data Science Context

Clustering algorithms
What are “space” and a “highdimensional space”?
What do we mean by space in data science?
In data science, we use the word “space” to refer to the mathematical space. For example, if we have a twodimensional dataset like the following one, a space with two axes is formed.
Name  Salary ($)  Age (Years) 
Jane  90000  52 
John  85000  48 
Delilah  75000  32 
Dave  90000  53 
Ellen  82000  44 
One of the two axes will be Age, and the other one will be Salary. Let us put age in the horizontal axis and Salary in the vertical axis.
Notice that each object of the dataset has become a point in the space. The space is twodimensional (has two axes) because the dataset is twodimensional (the dataset has two features). Hence, the dimension of a dataset actually refers to the space the dataset creates.
Jane and Dave have the same salary; that is why their positions in the verticalaxis are the same. Dave is one year older than Jane; that is why the marker for Dave is a bit right to the marker of Jane.
With three features — Salary, Age, and Years of service — the data becomes threedimensional.
Name  Salary ($)  Age (Years)  Years of service 
Jane  90000  52  10 
John  85000  48  20 
Delilah  75000  32  30 
Dave  90000  53  40 
Ellen  82000  44  20 
As a result, our data space will become threedimensional. We can plot our points, which are the rows of the data in a threedimensional space like the following one.
You might have recalled by this time that the space we are talking about is the Euclidean space from geometry, where there is an origin point with zero values for any axis. There are are two parts along any axes – one side contains positive values and another side contains negative values. The most common Euclidean spaces used in geometry are twodimensional and threedimensional spaces.
An example of a twodimensional Euclidean space is as follows. Each point in a twodimensional Euclidean space has two values (xvalue and yvalue.) These values actually correspond to the twofeatures of the twodimensional dataset.
The Euclidean space above is generated from the following data of two features.
1  3 
3  2 
1  1 
3  1 
1  3 
2  3 
3  1 
What happens if we have four features? Practically, we have a fourdimensional space. However, we do not have the capacity to visualize the fourdimensional space because we practically live in a threedimensional space. Note that, we still have the data and the corresponding fourdimensional space that we can use for any mathematical operations. We just cannot visualize the space.
An example of a fourdimensional space is provided below.
Salary ($)  Age (Years)  Years of service  Another feature 
90000  52  10  1 
85000  48  20  2 
75000  32  30  3 
90000  53  40  2 
82000  44  20  1 
Same about five features or a fivedimensional space — there can be a fivedimensional space resulting from a fivedimensional dataset but we cannot visualize anything with more than three dimensions because our eyes can only process up to three dimensions. An example of a fiveddimensional dataset is as follows.
Salary ($)  Age (Years)  Years of service  Another feature  Another another feature 
90000  52  10  1  10 
85000  48  20  2  2 
75000  32  30  3  3 
90000  53  40  2  5 
82000  44  20  1  5 
If we have 100 features then we have a 100dimensional space. If we have 1000 features, then we have a 1000dimensional dataset.
In general, if we have k features, we have a kdimensional dataset.
What is a high dimensional space?
A dataset with a number of dimensions greater than three is generally referred to as high dimensional data. However, the phrase “high dimensional” is vague. When it is text data, you can consider that you have several thousand to several tens of thousands of dimensions. If you have data that stores health information of people, you can consider that you have a few tens of dimensions to a little over a hundred dimensions.
Many of the algorithms we will learn are highly impacted by the number of dimensions or number of features of the data. This is why the number of dimensions is an important factor.
You might ask, isn’t the number of rows a factor as well. Yes, it is. But it is naturally expected that the more objects, or rows you have the more time the algorithm will take. If we have more features the runtime sometimes increases quite unexpectedly. You will hear things like this algorithm is good for a large dataset with low dimensional features. Or, that algorithm works better for high dimensional dataset compared to this one. The phrase “big data” is not only contributed by the number of objects, but also by the number of features or dimensions.
Leave a Reply
Want to join the discussion?Feel free to contribute!