What is a classification problem?

In a classification problem, you are given the data and for each data point a label. The data is commonly called labeled data. The task is to create a model from the labeled data so that the model can predict a label for any new data point for which the label is unknown.

For example, the following table contains (fictitious) data regarding coronavirus infection. Each row reflects information of a patient.

(All data tables on this page are fictitious. These data tables are prepared to explain the concept of classification. The data provided on this page must NOT be considered factual and must NOT be used to understand coronavirus symptoms.)

Average temperature	Oxygen processing (%)	Congestion (%)	Cough (per hour)	Factor in blood	Coronavirus
102	70	50	30	5	Infected
98	85	80	10	3	Infected
97	99	10	1	1	Not infected
100	82	60	15	2.5	Infected
97	85	30	30	2	Not infected
99	90	10	20	2	Infected

The last column of the data above contains two class labels — Infected and Not infected. A classification dataset always contains a column that provides class labels. In the table above, the last column, titled Coronavirus, records a class label for each patient.

A classification model

The first five columns in the table above hold the actual data that can help us build a classification model. A classification model studies the numbers in the table and comes up with a mathematical generalization of what kind combination of features is associated with what label.

A classifier

After the model is built, a classifier uses the classification model to predict a label for any new data point for which the label is unknown. For example, in the table below we have information about four new patients but we do not have the column with class labels — Infected or Not infected. In a classification task, a classifier uses a classification model (which is pre-built from data with labels, such as the previous table) to predict class labels of data for which no class label is present, such as the table below.

Average temperature	Oxygen processing (%)	Congestion (%)	Cough (per hour)	Factor in blood
101	83	62	17	2.6
97.8	84	31	32	2.1
99	88	81	11	4
97	91	11	21	2

Training dataset

Training data is the data that is used to build the classification model. A training dataset contains a column with class labels. For example, the first table of this page can be used as the training data do build a classification model.

Test data

Test data is the data that does not have the column for class labels but contains all the other columns like the training data. That is, test data only misses the label column (such as the Coronavirus column).

The second table of this page can be used to test the classification model built using the first table. That means, a classifier that uses a classification model built from the first table (training data) can classify each row of the second table (test data) to either Infected or Not infected.

We will discuss classification algorithms in the upcoming lessons.

1 Comment

Halake Degela

April 25, 2024 7:00 am

What is a classification problem?

A classification model

A classifier

Training dataset

Test data

1 Comment

Leave A Reply Cancel reply

Login with your site account

Register a new account

Modal title