What is a classification problem?
In a classification problem, you are given the data and for each data point a label. The data is commonly called labeled data. The task is to create a model from the labeled data so that the model can predict a label for any new data point for which the label is unknown.
For example, the following table contains (fictitious) data regarding coronavirus infection. Each row reflects information of a patient.
(All data tables on this page are fictitious. These data tables are prepared to explain the concept of classification. The data provided on this page must NOT be considered factual and must NOT be used to understand coronavirus symptoms.)
Average temperature | Oxygen processing (%) | Congestion (%) | Cough (per hour) | Factor in blood | Coronavirus |
102 | 70 | 50 | 30 | 5 | Infected |
98 | 85 | 80 | 10 | 3 | Infected |
97 | 99 | 10 | 1 | 1 | Not infected |
100 | 82 | 60 | 15 | 2.5 | Infected |
97 | 85 | 30 | 30 | 2 | Not infected |
99 | 90 | 10 | 20 | 2 | Infected |
The last column of the data above contains two class labels — Infected and Not infected. A classification dataset always contains a column that provides class labels. In the table above, the last column, titled Coronavirus, records a class label for each patient.
A classification model
The first five columns in the table above hold the actual data that can help us build a classification model. A classification model studies the numbers in the table and comes up with a mathematical generalization of what kind combination of features is associated with what label.
A classifier
After the model is built, a classifier uses the classification model to predict a label for any new data point for which the label is unknown. For example, in the table below we have information about four new patients but we do not have the column with class labels — Infected or Not infected. In a classification task, a classifier uses a classification model (which is pre-built from data with labels, such as the previous table) to predict class labels of data for which no class label is present, such as the table below.
Average temperature | Oxygen processing (%) | Congestion (%) | Cough (per hour) | Factor in blood |
101 | 83 | 62 | 17 | 2.6 |
97.8 | 84 | 31 | 32 | 2.1 |
99 | 88 | 81 | 11 | 4 |
97 | 91 | 11 | 21 | 2 |
Training dataset
Training data is the data that is used to build the classification model. A training dataset contains a column with class labels. For example, the first table of this page can be used as the training data do build a classification model.
Test data
Test data is the data that does not have the column for class labels but contains all the other columns like the training data. That is, test data only misses the label column (such as the Coronavirus column).
The second table of this page can be used to test the classification model built using the first table. That means, a classifier that uses a classification model built from the first table (training data) can classify each row of the second table (test data) to either Infected or Not infected.
We will discuss classification algorithms in the upcoming lessons.
1 Comment
ok