evaluation and credibility-Part 2

Tilani Gunawardena
Machine Learning and Data Mining
Evaluation and Credibility

• Introduction
• Train, Test and Validation sets
• Evaluation on Large data Unbalanced data
• Evaluation on Small data
– Cross validation
– Bootstrap
• Comparing data mining schemes
– Significance test
– Lift Chart / ROC curve
• Numeric Prediction Evaluation
Outline

Model’s Evaluation in the KDD Process

How to Estimate the Metrics?
• We can use:
– Training data;
– Independent test data;
– Hold-out method;
– k-fold cross-validation method;
– Leave-one-out method;
– Bootstrap method;
– And many more…

Estimation with Training Data
• The accuracy/error estimates on the training data are not
good indicators of performance on future data.
– Q: Why?
– A: Because new data will probably not be exactly the same as
the training data!
• The accuracy/error estimates on the training data
measure the degree of classifier’s overfitting.
Training set
Classifier
Training set

Estimation with Independent Test Data
• Estimation with independent test data is used when we
have plenty of data and there is a natural way to forming
training and test data.
• For example: Quinlan in 1987 reported experiments in a
medical domain for which the classifiers were trained on
data from 1985 and tested on data from 1986.
Training set
Classifier
Test set

Hold-out Method
• The hold-out method splits the data into training data and
test data (usually 2/3 for train, 1/3 for test). Then we build a
classifier using the train data and test it using the test data.
• The hold-out method is usually used when we have
thousands of instances, including several hundred instances
from each class.
Training set
Classifier
Test set
Data

Classification: Train, Validation, Test Split
Data
Predictions
Y N
Results Known
Training set
Validation set
+
+
-
-
+
Classifier Builder
Evaluate
+
-
+
-
ClassifierFinal Test Set
+
-
+
-
Final Evaluation
Model
Builder
The test data can’t be used for parameter tuning!

k-Fold Cross-Validation
• k-fold cross-validation avoids overlapping test sets:
– First step: data is split into k subsets of equal size;
– Second step: each subset in turn is used for testing and the
remainder for training.
• The estimates are averaged to
yield an overall estimate. Classifier
Data
train train test
train test train
test train train

Example
collect data from real world(photographs and labels)

Giving students the
answer before giving them
exam

If the world happens to be well
represented by our dataset

• Model Selection
• Evaluating our selection Method
CV

35
The Bootstrap
• CV uses sampling without replacement
– The same instance, once selected, can not be selected again for a
particular training/test set
• The bootstrap uses sampling with replacement to form
the training set
– Sample a dataset of n instances n times with replacement to form
a new dataset of n instances
– Use this data as the training set
– Use the instances from the original
dataset that don’t occur in the new
training set for testing

Example
• Sample of same size N(with replacement)
• N=4,M=N=4,M=3
• N=150, M=5000
• This gives M=5000 means of random samples
of X

37
The 0.632 bootstrap
• Also called the 0.632 bootstrap
– A particular instance has a probability of 1–1/n of not being picked
– Thus its probability of ending up in the test data is:
– This means the training data will contain approximately 63.2% of
the instances
368.0
1
1 1
=»÷
ø
ö
ç
è
æ
- -
e
n
n

38
Estimating error
with the bootstrap
• The error estimate on the test data will be very pessimistic
– Trained on just ~63% of the instances
• Therefore, combine it with the resubstitution error:
• The resubstitution error gets less weight than the error on
the test data
• Repeat process several times with different replacement
samples; average the results
instancestraininginstancestest 368.0632.0 eeerr ×+×=

39
More on the bootstrap
• Probably the best way of estimating performance for very
small datasets
• However, it has some problems
– Completely random dataset with two classes of equal size. The true
error rate is 50% for any prediction rule.
– Consider the random dataset from above
– 0% resubstitution error and
~50% error on test data
– Bootstrap estimate for this classifier:
– True expected error: 50%
%6.31%0368.0%50632.0 =×+×=err

• It is a straightforward way to derive estimates
of standard errors and confidence intervals for
complex estimators of complex parameters of
the distribution

41
Evaluation Summary:
• Use Train, Test, Validation sets for “LARGE”
data
• Balance “un-balanced” data
• Use Cross-validation for Middle size/small
data
• Use the leave-one-out and bootstrap methods
for small data
• Don’t use test data for parameter tuning - use
separate validation data

Agenda
• Quantifying learner performance
– Cross validation
– Error vs. loss
– Precision & recall
• Model selection

Accuracy Vs Precision
accuracy refers to the
closeness of a
measurement or estimate
to the TRUE value.
precision (or variance) refers to
the degree of agreement for a
series of measurements.

Precision Vs Recall
precision: Percentage of
retrieved documents that
are relevant.
recall: Percentage of relevant
documents that are returned.

Scenario
• We use a dataset with knows classes to build a
model
• We use another dataset with known classes to
evaluate the model(this dataset could be part
of the original dataset)
• We compare/count the predicted classes
against the actual classes

Confusion Matrix
• A confusion matrix shows the number of
correct and incorrect predictions made by the
classification model compared to the actual
outcomes(target value) in the data
• The matrix is NxN, where N is the number of
target values(Classes)
• Performance of such models commonly
evaluated using data in the matrix

Two Types of Error
False negative (“miss”), FN
alarm doesn’t sound but person is carrying metal
False positive (“false alarm”), FP
alarm sounds but person is not carrying metal

How to evaluate the Classifier’s
Generalization Performance?
Predicted class
Actual
class
Pos Neg
Pos TP FN
Neg FP TN
• Assume that we test a classifier on some
test set and we derive at the end the
following confusion matrix (Two-Class)
• Also called contingency table
P
N

Measures in Two-Class Classification

Example:
1) How many images of Gerhard Schroeder in the data set?
2) How many predictions of G Schroeder are there?
3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm?
4) Your learning algorithm predicted/classified as Hugo Chavez.
What is the probability he is actually Hugo Chavez?
5) Recall(“Hugo Chavez”) =
6)Precision(“Hugo Chavez”)=
7) Recall(“Colin Powell”)=
8) Precision(“Colin Powel”)=
9)Recall(“George W Bush”)=
10) Precision(“George W Bush”)=

1) True Positive (“Tony Blair”)=
2) False Positive (“Tony Blair”)=
3) False Negative(“Tony Blair”)=
4) True Positive (“Donald Rumsdeld”)=
5) False Positive (““Donald Rumsdeld”)=
6) False Negative(““Donald Rumsdeld”)=

Metrics for Classifier’s Evaluation
Predicted class
Actual
class
Pos Neg
Pos TP FN
Neg FP TN
• Accuracy = (TP+TN)/(P+N)
• Error = (FP+FN)/(P+N)
• Precision = TP/(TP+FP)
• Recall/TP rate = TP/P
• FP Rate = FP/N
P
N

Example: 3 classifiers
True
Predicted
pos neg
pos 60 40
neg 20 80
True
Predicted
pos neg
pos 70 30
neg 50 50
True
Predicted
pos neg
pos 40 60
neg 30 70
Classifier 1
TPR =
FPR =
Classifier 2
TPR =
FPR =
Classifier 3
TPR =
FPR =

Example: 3 classifiers
True
Predicted
pos neg
pos 60 40
neg 20 80
True
Predicted
pos neg
pos 70 30
neg 50 50
True
Predicted
pos neg
pos 40 60
neg 30 70
Classifier 1
TPR = 0.4
FPR = 0.3
Classifier 2
TPR = 0.7
FPR = 0.5
Classifier 3
TPR = 0.6
FPR = 0.2

Multiclass-Things to Notice
• The total number of test examples of any class would be the
sum of corresponding row(i.e the TP +FN for that class)
• The total number of FN’s for a class is sum of values in the
corresponding row(excluding the TP)
• The total number of FP’s for a class is the sum of values in
the corresponding column(excluding the TP)
• The total number of TN’s for a certain class will be the sum
of all column and rows excluding that class's column and row
Predicted
Actual A B C D E
A TPA EAB EAC EAD EAE
B EBA TPB EBC EBD EBE
C ECA ECB TPC ECD ECE
D EDA EDB EDC TPD EDE
E EEA EEB EEC EED TPE

Predicted
Actual A B C D E
A TPA EAB EAC EAD EAE
B EBA TPB EBC EBD EBE
C ECA ECB TPC ECD ECE
D EDA EDB EDC TPD EDE
E EEA EEB EEC EED TPE

Multi-class
Predicted
Act
ual
A B C
A TPA EAB EAC
B EBA TPB EBC
C ECA ECB TPC
Predicted class
Actual
class
P N
P TP FN
N FP TN
Predicted
Actual A Not A
A
Not A
Predicted
Actual B Not B
B
Not B
Predicted
Actual C Not C
C
Not C

Multi-class
Predicted
Act
ual
A B C
A TPA EAB EAC
B EBA TPB EBC
C ECA ECB TPC
Predicted class
Actual
class
P N
P TP FN
N FP TN
Predicted
Actual A Not A
A TPA EAB + EAC
Not A EBA + ECA TPB + EBC
ECB + TPC
Predicted
Actual B Not B
B TPB EBA + EBC
Not B EAB+ ECB TPA + EAC
ECA + TPC
Predicted
Actual C Not C
C TPC ECA + ECB
Not C EAC + EBC TPA + EAB
EBA + TPB

Example:
A B C
A 25 5 2
B 3 32 4
C 1 0 15
Overall Accuracy:
Precision A=
Recall B=
Predicted
Actual

Example:
A B C
A 25 5 2
B 3 32 4
C 1 0 15
Overall Accuracy = (25+32+15)/(25+5+2+3+32+4+1+0+15)
Precision A= 25/(25+3+1)
Recall B= 32/(32+3+4)

Counting the Costs
• In practice, different types of classification
errors often incur different costs
• Examples:
– ¨ Terrorist profiling
• “Not a terrorist” correct 99.99% of the time
– Loan decisions
– Fault diagnosis
– Promotional mailing

Cost Matrices
Pos Neg
Pos TP Cost FN Cost
Neg FP Cost TN Cost
Usually, TP Cost and TN Cost are set equal to 0
Hypothesized
class
True class

Lift Charts
• In practice, decisions are usually made by comparing
possible scenarios taking into account different costs.
• Example:
• Promotional mail out to 1,000,000 households. If we
mail to all households, we get 0.1% respond (1000).
• Data mining tool identifies
-subset of 100,000 households with 0.4% respond
(400); or
- subset of 400,000 households with 0.2% respond
(800);
• Depending on the costs we can make final decision
using lift charts!
• A lift chart allows a visual comparison for measuring
model performance

Generating a Lift Chart
• Given a scheme that outputs probability, sort the
instances in descending order according to the predicted
probability
• In lift chart, x axis is sample size and y axis is number of
true positives.
Rank Predicted
Probability
Actual Class
1 0.95 Yes
2 0.93 Yes
3 0.93 No
4 0.88 Yes
….. …. ….

Example 01: Direct Marketing
• A company wants to do a mail marketing campaign
• It costs the company $1 for each item mailed
• They have information on 100,000 customers
• Create a cumulative gains and lift charts from the
following data
• Overall Response Rate: If we assume we have no
model other than the prediction of the overall
response rate, then we can predict the number of
positive responses as a fraction of the total customers
contacted
• Suppose the response rate is 20%
• If all 100,000 customers are contacted we will receive
around 20,000 positive responses

Cost($) Total Customers Contacted Positive Responses
100000 100000 20000
• Prediction of Response Model: A
Response model predicts who will
respond to a marketting campaign
• If we have a response model, we can
make more detailed predictions
• For example, we use the response
model to assign a score to all
100,000 customers and predict the
results of contacting only the top
10,000 customers, the top 20,000
customers ,etc
Cost($) Total
Customers
Contacted
Positive
Responses
10,000 10,000 6,000
20,000 20,000 10,000
30,000 30,000 13,000
40,000 40,000 15,800
50,000 50,000 17,000
60,000 60,000 18,000
70,000 70,000 18,800
80,000 80,000 19,400
90,000 90,000 19,800
100,000 100,000 20,000

Cumulative Gains Chart
• The y-axis shows the percentage of positive responses.
This is a percentage of the total possible positive
responses (20,000 as the overall response rate shows)
• The x-axis shows the percentage of customers
contacted, which is a fraction of the 100,000 total
customers
• Baseline(Overall response rate): If we contact X% of
customers then we will receive X% if the total positive
responses
• Lift Curve: Using the predictions of the response
model, calculate the percentage of positive responses
for the percent customers contacted and map these
points to create the lift curve

Cost($) Total Customers
Contacted
Positive
Responses
10,000 10,000 6,000
20,000 20,000 10,000
30,000 30,000 13,000
40,000 40,000 15,800
50,000 50,000 17,000
60,000 60,000 18,000
70,000 70,000 18,800
80,000 80,000 19,400
90,000 90,000 19,800
100,000 100,000 20,000

Lift Chart
• Shows the actual lift.
• To plot the chart: Calculate the points on the lift
curve by determining the ratio between the
result predicted by our model and the result
using no model.
• Example: For contacting 10% of customers, using
no model we should get 10% of responders and
using the given model we should get 30% of
responders. The y-value of the lift curve at 10% is
30 / 10 = 3

Lift Chart
Cumulative gains and lift charts are a graphical
representation of the advantage of using a predictive
model to choose which customers to contact

Example 2:
• Using the response model
P(x)=100-AGE(x) for
customer x and the data
table shown below,
construct the cumulative
gains and lift charts.

Calculate P(x) for each person x
1. Calculate P(x) for each person x
2. Order the people according to rank
P(x)
3. Calculate the percentage of total
responses for each cutoff point
Response Rate = Number of Responses /
Total Number of Responses
Total
Custo
mer
Conta
cted
#of
Respo
nses
Respo
nse
Rate
2
4
6
8
10
12
14
16
18
20

Calculate P(x) for each person x
1. Calculate P(x) for each person x
2. Order the people according to rank
P(x)
3. Calculate the percentage of total
responses for each cutoff point
Response Rate = Number of Responses /
Total Number of Responses

Cumulative Gains vs Lift Chart
The lift curve and the baseline have the same
values for 10%-20% and 90%-100%.

ROC Curves
• ROC curves are similar to lift charts
– Stands for “receiver operating characteristic”
– Used in signal detection to show tradeoff between
hit rate and false alarm rate over noisy channel
• Differences from gains chart:
– x axis shows percentage of false positives in
sample, rather than sample size

81
Non-diseased
cases
Diseased
cases
Threshold

ROC Curves and Analysis
True
Predicted
pos neg
pos 60 40
neg 20 80
True
Predicted
pos neg
pos 70 30
neg 50 50
True
Predicted
pos neg
pos 40 60
neg 30 70
Classifier 1
TPr = 0.4
FPr = 0.3
Classifier 2
TPr = 0.7
FPr = 0.5
Classifier 3
TPr = 0.6
FPr = 0.2

ROC analysis
• True Positive Rate
– TPR = TP / (TP+FN)
– also called sensitivity
– true abnormals called abnormal by the observer
• False Positive Rate
– FPR = FP / (FP+TN)
• Specificity (TNR)= TN / (TN+FP)
– True normals called normal by the observer
• FPR = 1 - specificity

Evaluating
classifiers (via
their ROC
curves)
Classifier A
can’t
distinguish
between
normal and
abnormal.
B is better but
makes some
mistakes.
C makes very
few mistakes.

“Perfect”
means no
false positives
and no false
negatives.

Quiz 4:
1) How many images of Gerhard Schroeder in the data set?
2) How many predictions of G Schroeder are there?
3) What is the Probability that Hugo Chavez classified correctly in our learning algorithm?
4) Your learning algorithm predicted/classified as Hugo Chavez.
What is the probability he is actually Hugo Chavez?
5) Recall(“Hugo Chavez”) =
6)Precision(“Hugo Chavez”)=
7) Recall(“Colin Powell”)=
8) Precision(“Colin Powel”)=
9)Recall(“George W Bush”)=
10) Precision(“George W Bush”)=

evaluation and credibility-Part 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to evaluation and credibility-Part 2

Similar to evaluation and credibility-Part 2 (20)

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

Recently uploaded

Recently uploaded (20)

evaluation and credibility-Part 2