
An Introduction to Data

Data Dimensionality and Space

Proximity in Data Science Context

Clustering algorithms
What is a dimension?
You might have heard the word “dimension.” You might have heard people say the term “high dimensional data.” Let us discuss what this term dimension means.
Here is the tabular data from the previous lesson.
Name  Salary ($)  Age (Years) 
Jane  90000  52 
John  85000  48 
Delilah  75000  32 
Dave  90000  53 
Ellen  82000  44 
We said that the actual data part in the table above is:
90000  52 
85000  48 
75000  32 
90000  53 
82000  44 
In this running example, we have two features or two columns, as explained in the previous lesson. We have five objects or five rows.
We call the data of our running example a twodimensional dataset. That is the number of features is equal to the number of dimensions of the dataset. Again, the table above is a twodimensional dataset because the table has two features or columns.
That is:
Number of features = number of dimensions
If we had three features or three columns, we would have called this a threedimensional dataset. An example is provided below. The table below has three features and five objects.
90000  52  10 
85000  48  20 
75000  32  30 
90000  53  40 
82000  44  20 
If we had four features or four columns, we would have called this a fourdimensional dataset. An example is below.
90000  52  10  50 
85000  48  20  60 
75000  32  30  30 
90000  53  40  35 
82000  44  20  40 
I am sure, the idea is clear by this time. If the dataset has 1 feature, it is called, 1dimensional; with 2 features it is called 2dimensional, so and so forth. With n features or n columns, the data is called ndimensional.
Feature 1  Feature 2  Feature 3  Feature 4  — —  Feature n 
90000  52  10  50  43  
85000  48  20  60  2  
75000  32  30  30  73  
90000  53  40  35  36  
82000  44  20  40  90 
Notice one thing here — regardless of the number of features or number of columns, or the number of dimensions, the data table can be stored in a twodimensional array. That is, even one hundreddimensional dataset can be kept in a 2D array or in a 2D matrix.
The word “dimension” in programming is used to count the number of cells. In data science, the word “dimension” has a different meaning. “Dimension” in data science refers to the mathematical space, such as the Euclidian space.
As an example, the following data table has three columns or three features. There are five objects or five rows.
90000  52  10 
85000  48  20 
75000  32  30 
90000  53  40 
82000  44  20 
In programming, we will say that this table can be stored in a 2dimensional array of size 5 times 3. That means, it has five rows and three columns.
In data science, this table is called a threedimensional dataset because it composes a mathematical space of three dimensions.
Similarly, a data table with four columns, such as the following one, is referred to as a fourdimensional dataset even though we store it in a twodimensional array.
90000  52  10  50 
85000  48  20  60 
75000  32  30  30 
90000  53  40  35 
82000  44  20  40 
That is a higher number of features would mean a higher number of dimensional mathematical space. The physical memory space is the memory occupied with the corresponding twodimensional array. The physical memory is a programming concept and always a 2dimensional array for an anydimensional dataset.
Leave a Reply
Want to join the discussion?Feel free to contribute!