Evaluation of classification results

Evaluation of a classification model refers to determining the model’s accuracy and effectiveness in correctly categorizing test data into pre-defined classes. It helps to determine if the model is suitable for the task it was designed for and if it needs to be modified or improved.

In classification algorithms, the data used to train the model is referred to as the training data, and the data used to evaluate the model’s performance is called the test data. That is, training data is a set of labeled examples that are used to teach a machine learning algorithm how to classify new data points. The training data is typically a subset of the overall data set and should represent the data the model will encounter in the future.

Test data, on the other hand, is a set of examples that are used to evaluate the performance of the trained model. It is used to measure the accuracy of the model’s predictions on unseen data. The test data should be independent of the training data, and it should be representative of the data that the model will encounter in the future. The test data is (also) typically a subset of the overall data set. The test data is systematically separated from the overall data to evaluate the strength of a classification algorithm and to ensure that the labels during evaluation for the test data are available for assessment purposes.

In summary, training data is used to teach the model, while test data is used to evaluate the model’s performance. Both are important components of a classification algorithm and are used to improve the accuracy and generalization ability of the model.

Common classification evaluation metrics are described below.

Accuracy

Accuracy is the proportion of correct predictions of test data made by the model.

Accuracy= (Number of Correct Predictions) / (Total Number of Predictions)

Pros: Easy to understand, widely used.

Cons: Accuracy can be misleading when classes are imbalanced (i.e., one class has many more samples than the other).

True positive, false positive, true negative, and false negative

True positive, false positive, true negative and false negative are terms used in binary classification, where a model is trained to predict one of two classes of test data points, typically referred to as positive and negative.

A true positive is a prediction made by the model that an instance belongs to the positive class and actually belongs to the positive class. For example, a medical diagnostic test correctly identifies a patient as having a certain disease is a true positive.

A false positive is a prediction made by the model that an instance belongs to the positive class, but the instance actually belongs to the negative class. For example, a medical diagnostic test incorrectly identifying a healthy patient as having a certain disease is a false positive.

A true negative is a prediction made by the model that an instance belongs to the negative class, and the instance actually does belong to the negative class. For example, a medical diagnostic test that correctly identifies a healthy patient as not having a certain disease is a true negative.

A false negative is a prediction made by the model that an instance belongs to the negative class, but the instance actually belongs to the positive class. For example, a medical diagnostic test that incorrectly identifies a patient as not having a certain disease when the patient actually has the disease is a false negative.

These terms are used to calculate other important metrics in binary classification, such as precision, recall, accuracy, and F1-score, which provide a more comprehensive understanding of a model’s performance.

Confusion Matrix

A confusion matrix is a table commonly used to evaluate the performance of a classification algorithm when it is applied to the test data. The confusion matrix summarizes the predictions made by the algorithm on a dataset by comparing the predicted class labels with the true class labels of the instances in the dataset.

Confusion matrix for binary classification

In a binary classification problem, a confusion matrix typically has four entries: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). True positives represent the number of instances correctly classified as positive by the algorithm, while false positives represent the number of instances incorrectly classified as positive. Similarly, true negatives represent the number of instances correctly classified as negative, while false negatives represent the number of instances incorrectly classified as negative.

	Predicted Positive	Predicted Negative
Actual Positive	TP	FN
Actual Negative	FP	TN

The False Positive count in the confusion matrix is also called a Type 1 error, whereas the False Negative count is called a Type 2 error.

Let us say that we have a test dataset with ten rows. The following table gives us the actual labels and the labels predicted by a classification algorithm.

Actual	Predicted
Win	Win
Lose	Win
Lose	Lose
Win	Win
Win	Lose
Lose	Lose
Lose	Win
Win	Win
Lose	Lose
Lose	Lose

We consider one label to be positive and the other label to be negative in binary classification. Let us assume Win is positive and Lose is negative. The confusion matrix for this given actual labels and predicted labels will be the following.

	Predicted Positive	Predicted Negative
Actual Positive	TP=How many actual “Win”s were predicted as “Win”=3	FN=How many actual “Win”s were predicted as “Lose” =1
Actual Negative	FP=How many actual “Lose” were predicted as “Win”=2	TN=How many actual “Lose” were predicted as “Lose”=4

That is the confusion matrix is:

	Predicted Win	Predicted Lose
Actual Win	3	1
Actual Los	2	4

The summation of all the numbers in the confusion matrix equals the number of data points in the test data, in this case, 10 (=3+1+2+4).

Confusion matrix for a multi-class classification problem

It is possible to construct a confusion matrix for a multi-class classification problem based on how many samples of an actual class label are classified as what class label. Consider that we have three labels — Sports, Politics, and Fashion — in a dataset of 301 data points. Let us say that we have the following confusion matrix for this dataset after a classification algorithm is applied.

	Predicted Sports	Predicted Politics	Predicted Fashion
Actual Sports	70	10	20
Actual Politics	20	65	15
Actual Fashion	15	20	66

The rows represent the actual labels (Sports, Politics, and Fashion), while the columns represent the predicted labels. The numbers in the matrix cells represent the counts of instances that belong to each label.

For example, 70 instances belong to the Sports class that were correctly predicted as Sports (true positives), 10 Sports instances were incorrectly predicted as Politics (false negatives), and 20 Sports instances were incorrectly predicted as Fashion (false negatives).

Similarly, 20 instances belong to the Politics class that were incorrectly predicted as Sports (false negatives), 65 Politics instances were correctly predicted as Politics (true positives), and 15 Politics instances were incorrectly predicted as Fashion (false negatives).

Finally, there were 15 instances that belong to the Fashion class but were incorrectly predicted as Sports (false negatives), 20 Fashion instances were incorrectly predicted as Politics (false negatives), and 66 Fashion instances were correctly predicted as Fashion (true positives).

Therefore, it is possible to construct a binary confusion matrix for each label. When we consider one certain label as positive, the other labels will be considered negative.

The binary confusion matrix for Sports will be as follows.

	Predicted Sports (Positive)	Predicted Non-Sport (Negative)
Actual Sports (Positive)	70	30 (=10+20)
Actual Non-Sport (Negative)	35 (20+15)	166 (=65+15+20+66)

The binary confusion matrix for Politics will be as follows.

	Predicted Politics (Positive)	Predicted Non-Politics (Negative)
Actual Politics (Positive)	65	35 (=20+15)
Actual Non-Politics (Negative)	30 (=10+20)	171 (=70+20+15+66)

The binary confusion matrix for Fashion will be as follows.

	Predicted Fashion (Positive)	Predicted Non-Fashion (Negative)
Actual Fashion (Positive)	66	35 (=15+20) Note: From the last row of the original confusion matrix
Actual Non-Fashion (Negative)	35 (=20+15) Note: From the last column of the original confusion matrix	165 (=70+10+20+65)

Note that for each of the three binary matrices — for Sports, Politics, and Fashion — the summation of the cells is 301 because the original confusion matrix had a summation of 301. That is, the data had 301 data points/samples/instances/objects.

Pros: Provides detailed information about the model’s performance, including the number of true positive, false positive, true negative, and false negative predictions.

Cons: It is difficult to interpret for non-experts.

Precision (Positive Predictive Value)

Precision is the ratio of true positive predictions to the model’s total number of positive predictions. That is, it provides information about the proportion of positive predictions that are actually correct. Precision is the accuracy of the positive predictions.

Precision=(True Positive) / (True Positive + False Positive)

Why do we need precision?

Precision is an important metric because it helps to determine the reliability of the positive predictions made by a classification model. In some cases, a high number of false positive predictions can be more damaging than a high number of false negative predictions. For example, in a medical diagnosis scenario, a false positive prediction (i.e., predicting a disease when the patient is actually healthy) can lead to unnecessary and potentially harmful treatment. In contrast, a false negative prediction (i.e., not predicting a disease when the patient is actually sick) can lead to a delay in treatment and potentially serious consequences.

In what situation should we use precision?

Precision is most useful when we are concerned about the accuracy of positive predictions, and the consequences of false positive predictions are damaging. In cases where the positive class is rare, precision becomes an even more important metric, as a small number of false positive predictions can significantly impact the overall precision of the model. Precision is also commonly used in information retrieval and recommendation systems, where it is important to provide relevant and accurate results.

Pros: It provides information about the proportion of positive predictions that are actually correct.

Cons: Can be affected by the imbalance of the classes.

Recall (Sensitivity, Hit Rate, True Positive Rate)

The recall is the ratio of true positive predictions to the total number of actual positive samples. A high recall value indicates that the model can identify positive instances in the data, even if it also predicts some negative instances as positive (false positive). Recall measures the completeness of the positive predictions in the sense that it measures how well the model is able to identify all the positive instances in the data.

Recall= (True Positive) / (True Positive + False Negative)

Why is recall required?

The Recall is an important metric because it helps to determine the ability of the model to correctly identify all of the positive samples in a dataset. In some cases, a high number of false negative predictions can be more damaging than a high number of false positive predictions. For example, in a fraud detection scenario, a false negative prediction (i.e., not detecting a fraudulent transaction) can lead to a significant financial loss, whereas a false positive prediction (i.e., incorrectly flagging a legitimate transaction as fraudulent) can be easily corrected.

When should we use recall?

Recall is most useful when we are concerned about missing positive predictions, and the consequences of false negative predictions are quite damaging. In cases where the positive class is rare, recall becomes an even more important metric, as a small number of false negative predictions can significantly impact the overall recall of the model. The recall is also commonly used in anomaly detection, where it is important to identify all instances of the positive class, even if it results in a high number of false positive predictions.

Pros: It provides information about the proportion of actual positive samples that were correctly predicted as positive.

Cons: It can be affected by the imbalance-ness of classes.

True Negative Rate (TNR/ Specificity/ Selectivity)

True Negative Rate, also known as specificity, measures the proportion of actual negative samples correctly classified as negative by a binary classification model. It is calculated as:

True Negative Rate = True Negatives / (True Negatives + False Positives)

In other words, the True Negative Rate measures how well the model is able to identify negative samples as negative. A high True Negative Rate indicates that the model is good at identifying negative samples, while a low True Negative Rate indicates that the model is incorrectly classifying negative samples as positive.

The True Negative Rate is an important evaluation metric in binary classification tasks, especially when the negative class is of particular interest. It can be used in combination with other evaluation metrics such as True Positive Rate, False Positive Rate, and Precision to get a more complete picture of the performance of a classification model.

Cons: TNR can be heavily influenced by imbalanced data, leading to a high TNR for models that perform poorly overall. It does not take into account the true positive rate (TPR) or false positive rate (FPR) of the model, and may not be sufficient to evaluate the model’s overall performance. It does not provide any information about the magnitude of the errors made by the model, and may not be sufficient to evaluate the practical implications of the model’s performance.

F1 Score

The F1 score is a metric that combines precision and recall into a single score that represents the overall performance of a binary classifier. The F1-score is the harmonic mean of precision and recall, and it balances the two metrics by considering both the completeness of the positive predictions (recall) and the accuracy of the positive predictions (precision).

F1 score= 2 * ((Precision * Recall) / (Precision + Recall))

A high F1 score indicates that the model has a good balance between precision and recall and that it is able to make accurate positive predictions while also identifying most of the positive instances in the data.

The F1 score is particularly useful when the positive class is rare, or when the costs associated with false positive and false negative errors are not equal. In such cases, precision and recall may provide conflicting information, and the F1 score can be used to balance the trade-off between precision and recall.

When should we use F1 score?

In general, the F1 score should be used whenever we want to have a single score that represents the overall performance of a binary classifier, especially when the positive class is rare or when false positive and false negative errors have different costs.

Pros: It provides a balance between precision and recall and is a good measure for evaluating models when the classes are imbalanced.

Cons: It can be affected by the imbalance-ness of classes.

ROC Curve (Receiver Operating Characteristic Curve)

The ROC curve represents the relationship between the true positive rate (recall) and the false positive rate (1 – specificity) for a binary classifier.

Pros: It visually represents the trade-off between false positive and false negative rates.

Cons: It is difficult to interpret for non-experts.

Cross-Validation

Cross-validation is a technique used to evaluate the performance of a model by dividing the data into several folds and using different folds for training and testing. It helps to prevent overfitting issues during evaluation (i.e., the model performs well on the training data but poorly on the test data) by using different data for training and testing.

Cross-validation works by dividing the original dataset into several smaller parts, called “folds.” The model is trained on a portion of the data, called the training set, and evaluated on another portion, called the validation set. This process is repeated multiple times, with each fold being used as the validation set in turn, and the average performance of the model is calculated based on all of the validation sets.

There are several types of cross-validation, including k-fold cross-validation and leave-one-out cross-validation.

k-fold cross-validation:
Suppose we have a dataset with 1000 samples, and we want to perform k-fold cross-validation with k = 10. This means we will randomly divide the 1000 samples into 10 equal-sized folds, each containing 100 samples. The model will be trained on 9 folds (900 samples) and evaluated on the remaining fold (100 samples), 10 times, with each fold being used once as the validation set. The average performance of the model on the validation sets is calculated and used as an estimate of the model’s performance.

Leave-one-out cross-validation:
Suppose we have a dataset with 100 samples, and we want to perform leave-one-out cross-validation. This means that the model will be trained on 99 samples and evaluated on the remaining sample. This will repeat 100 times, with each sample being used once as the validation set. The average performance of the model on the validation sets is calculated and used as an estimate of the model’s performance.

In both types, cross-validation provides a more reliable estimate of a model’s performance than a single training and validation set, as it considers the model’s performance on various subsets of the data.

1 Comment

TurayYusif

June 8, 2022

Thank you so much i have completed my first step of the course, and i really understand now what is purely DATA SCIENCE is about.