[데이터 과학] Confusion matrix & AUROC
- Minwu Kim
- 2024년 2월 26일
- 3분 분량
최종 수정일: 2024년 2월 29일
NOTION LINK
There are numerous other metrics to assess the performance of a prediction model apart from accuracy. This is especially important when it comes to imbalanced dataset like activism campaigns.
For example, let’s say there is a fraud transaction detection model. 1 among 10000 cases is the fraud. If we simply make a model that predicts a transaction as “not a fraud”, it will boast the accuracy as high as 99.99%. However, this is obviously a useless model.
In order to resolve this issue, we need to introduce some other metrics to better assess the performance.
1. Confusion Matrix
Everything starts from TP, TN, FP, and FN. We need to digest all these concepts before delving into those metrics.
True positive: predicted as positive, and actually positive
True negative: predicted as negative, and actually negative
False positive: predicted as positive, but actually negative
False negative: predicted as negative, but actually positive
Notes:
T and F indicates whether prediction is correct or wrong.
P and N indicates whether the prediction is positive or negative
2. Different performance measures
2.1. Accuracy:
Accuracy: Among all the data, how many did the model predict correctly?
Note:
Accuracy might not be a good indicator of performance when it comes to heavily imbalanced data.
For example, say only 1 out of 10000 phone calls are phishing. The model that predicts everything as FALSE is useless yet boast a high accuracy.
2.2. Precision:
Precision: Among all the data predicted as positive, how many are actually positive?
Note:
If threshold is higher, precision tends to be higher.
This is because higher threshold means the prediction is more conservative, making it less likely to make false positive predictions
2.3. Recall (sensitivity):
Recall: Among all the positive data, how many did the model predict correctly?
Note:
If threshold is lower, recall tends to be higher
This is because higher threshold means prediction is less conservative (or more sensitive), making it less likely to make false negative predictions
2.4. F1-score:
F1-score: using harmonic mean to indicate the balance between precision and recall.
For example, say the threshold of a classification model is set very high. This will high precision but low recall. F1 score will naturally penalize the imbalance between the two.
2.5. Specificity (recall for negative cases):
Specificity: Among all the actual negative data, how many did the model predict correctly?
Conservative (or less sensitive) threshold tend to have higher specificity.
When the threshold is higher, there are fewer false positive, which leads to high specificity
2.6. False positive rate (FPR)
3. AUROC
ROC: Receiver Operating Characteristic curve
AUC: Area Under the Curve
3.1. Explanation:
X-axis: FPR
Y-axis: Recall (or TPR or sensitivity)
again:
ROC is the aggregation of (FPR, TPR) tuples in all the possible ranges of threshold from 0 to 1.
When threshold goes up:
Recall: True positive goes down → recall goes down
FPR: False Positive goes down → FPR goes down
When threshold falls, there will be more data predicted as positive, leading to the rising of both Recall and FPR
To plot ROC, we plot the dot for all the possible thresholds.
From bottom left corner to top right corner, the threshold decreases.
3.2. 3 extreme cases
For recall, the higher the better, for FPR, the lower the better. Utilizing such a nature, we can use AUC to measure the performance of a model.
To understand it better, let’s see three extreme cases:
Case 1 - perfect classifier:
returns 1 for TRUE, 0 for FALSE.
Recall is always 1 and FPR is always 0.
Case 2 - 0.5 (Random)
Case 3 - 0
Recall is always 0 and FPR is always 1.
In the real life, for the models with a certain level of prediction ability, the ROC AUC is between 0.5 and 1.







댓글