尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Evaluation*
in ubiquitous computing
* (quantitative evaluation, qualitative studies are a field of their own!)
Nils Hammerla, Research Associate @ Open Lab
Evaluation? Why?
• Performance evaluation
• How well does it work?
• Model selection
• Which approach works best?
• Parameter selection
• How do I tune my approach for best results?
• Demonstration of feasibility
• Is it worthwhile investing more time/money?
Training and test-sets
Basic elements, segments, images, etc.
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining validation
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Run 1,2,3 Run 4
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Run 1,2,3 Run 4
• User-dependent performance

How would the system perform for more data from a known user?

• User-independent performance

How would the system perform for a new user?
Training and test-sets
• Repeated hold-out
• Split data into k continuous folds
• A split into n folds leads to n performance estimates
• Gives more reliable performance estimate
• But: this depends on how the folds are constructed!
• Variants
• Leave-one-subject-out (LOSO) — User-independent
• Leave-one-run-out — User-dependent
Training and test-sets
• (Random, stratified) cross-validation
• Split data into k folds, on the lowest level
• Folds constructed to be stratified w.r.t. class distribution
• Popular in general machine learning
• Standard approach in many ML frameworks
• User-dependent performance estimate
• Requires samples to be statistically independent
• This is not the case in most cases in ubicomp!
Pitfalls
• User dependent vs independent performance
• Assumption: User dependent performance is the upper bound of
possible system performance
• Cross-validation therefore popular in ubicomp to demonstrate
feasibility of e.g. new technical approach
study
1 2 3 4 5 6
accuracy
0
10
20
30
40
50
60
70
80
90
100
Loso vs xval performance in related work
xval
loso
Pitfalls
• Cross-validation in segmented time-series
• When testing for a segment i, it is very likely that segments i-1 or i
+1 are in the training set.
• Neighbouring segments typically overlap
• They are therefore very similar!
• This biases the results and leads to bloated performance figures
Pitfalls
• Cross-validation in segmented time-series
• This is an issue, as cross-validation is widely used
• For user-dependent evaluation
• For parameter estimation (e.g. in neural networks)
• A simple extension of cross-validation alleviates the bias
• There will be a talk on this on Friday morning!
Pitfalls
• Reporting results
• Always report means and standard deviation of results
• T-tests for difference in mean performance
• Whether t-tests on cross-validated results are meaningful is still
debated in statistics
• You have to use the same folds when comparing different
approaches!
• Better: repeat whole experiment multiple times (if computationally
feasible)
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Precision / PPV tp / (tp + fp)
Recall / Sensitivity tp / (tp + fn)
Specificity tn / (tn + fp)
Accuracy (tp+tn) / (tp + fp + fn + tn)
F1-score 2*prec*sens / (prec+sens)
/ ( + )
/ ( + )
/ ( + )
For each class (or for two class problems):
( + )/( + + + )
Overall accuracy
Mean accuracy
Weighted F1-score
Mean F1-score
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
For a set of classes C = {A,B,C,…}:
1
|C|
X
c2C
accc
1
N
X
c2C
tpc
2
N
X
c2C
nc
precc ⇥ sensc
precc + sensc
2
|C|
X
c2C
precc ⇥ sensc
precc + sensc
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Normalised confusion matrices
- Divide each row by its sum to get confusion probabilities
3284
449
311
66
446
1275
1600
319
133
647
2865
285
24
198
715
412
a) Confusion on HOME data-set
prediction
asleep off on dys
annotation(diary)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
13
155
180
39
101
414
117
3
3
164
b) Confusion on LAB data-set
prediction
asleep off on dys
annotation(clinician)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
[Hammerla, Nils Y., et al. "PD Disease State Assessment in Naturalistic Environments using Deep Learning." AAAI 2015]
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Receiver Operator Characteristics (ROC)
- Illustrates trade-off between 

True Positive Rate (sensitivity / recall), and
False Positive Rate (1 - specificity)
- Useful if approach has a simple parameter,
like a threshold.
0 0.1 0.2 0.3 0.4
0
0.2
0.4
0.6
0.8
1
false positive rate
truepositiverate
ROC curves for different classifiers
knn
c45
logR
PCA−logR
[Ladha, Cassim, et al. "ClimbAX: skill
assessment for climbing enthusiasts."
Ubicomp 2013]
Log likelihood
AIC / BIC
KL-divergence
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Probabilistic models:
Mean Square Error (MSE)
Root Mean Square Error (RMSE)
Residuals
Correlation coefficients
Regression:
Example
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
A B C
PPV 0.77 0.89 0.44
Sensitivity 0.87 0.63 0.66
Specificity 0.71 0.95 0.92
Accuracy 0.79 0.83 0.90
F1-score 0.81 0.73 0.53
Overall Accuracy 0.76
Weighted F1-score 0.76
Mean F1-score 0.69
prediction
A B C
groundtruth
A
B
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Which metric should I not use?
• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )
Which metric should I not use?
• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )
/ ( + ) / ( + )PPV or Recall
Better:
Which metric should I not use?
• Overall accuracy
• In unbalanced data-sets, overall accuracy can be
misleading
1
N
X
c2C
tpc
10,000 10 2
100 1 0
20 1 0
Overall accuracy = 0.99
Weighted F1-score = 0.98
Mean F1-score = 0.34
9,950 40 22
40 45 15
10 1 10
Overall accuracy = 0.99
Weighted F1-score = 0.99
Mean F1-score = 0.59
<
<<
~=
Which metric should I use?
Balanced Unbalanced
Two-class
- Accuracy
- F1-score
- Sensitivity, Specificity
- PPV, Recall for rare class
- Mean F1-score
Multi-class
- Accuracy
- Weighted F1
- Mean F1
- Avg. PPV, Sensitivity
- Mean F1 vs weighted F1
- Confusion matrices
No single metric reflects all aspects of performance!
Which metric should I use?
• Application requirements
• Different applications have different requirements
• What type of error is more critical?
• False positives can hurt if activities of interest rare
• Low precision can be a problem in medical applications
• There is always a trade-off between different aspects of
performance
Which metric should I use?
• Application requirements
• Different applications have different requirements
• What type of error is more critical?
• False positives can hurt if activities of interest rare
• Low precision can be a problem in medical applications
• There is always a trade-off between different aspects of
performance
Ward et al. (2011)
• Continuous recognition
• Measures like accuracy treat each error the same
• But: The predictions are time-series themselves, and different errors
are possible
• Which type of error is more acceptable in my application?
[Jamie Ward et al., Performance Metrics for Activity Recognition]
Ward et al. (2011)
• Continuous recognition
• Similar to bioinformatics:
• Merge, Deletion, Confusion, Fragmentation, Insertion
• Each type of error is scored differently (cost matrix)
• Compound score reflects (adjustable) performance metric
• Systems that show very good performance in regular metrics may
show poor performance in one of these aspects!
[Jamie Ward et al., Performance Metrics for Activity Recognition]
Summary
• Careful when selecting training and test-set
• Different aspects of performance (user dependent / independent)
• Avoid random cross-validation for segmented time-series!
• Performance metrics
• No single metric reflects all aspects of performance
• Depends on your application!
• More difficult for unbalanced multi-class problems
• Continuous recognition
• Different types of errors have different impact on perceived
performance of a system
• Scoring systems can be tailored for individual application

More Related Content

What's hot

Creating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with ConcertoCreating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with Concerto
Mizumoto Atsushi
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
ankit_ppt
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
Pradeep Redddy Raamana
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
Pier Luca Lanzi
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_Ali
MDO_Lab
 
No struggle with test design (presentation at TestExpo 2015 Denmark)
No struggle with test design (presentation at TestExpo 2015 Denmark)No struggle with test design (presentation at TestExpo 2015 Denmark)
No struggle with test design (presentation at TestExpo 2015 Denmark)
Rik Marselis
 
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
Shibaprasad Bhattacharya
 
Test design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTARTest design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTAR
Rik Marselis
 
Experimental design
Experimental designExperimental design
Experimental design
Learning Development Centre
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
Chamin Nalinda Loku Gam Hewage
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
Marina Santini
 
forecasting model
forecasting modelforecasting model
forecasting model
FEG
 
Computer Adaptive Test (cat)
Computer Adaptive Test (cat)Computer Adaptive Test (cat)
Computer Adaptive Test (cat)
Tabraiz Bukhari
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
ankit_ppt
 
Introduction to Computerized Adaptive Testing (CAT)
Introduction to Computerized Adaptive Testing (CAT)Introduction to Computerized Adaptive Testing (CAT)
Introduction to Computerized Adaptive Testing (CAT)
Nathan Thompson
 

What's hot (16)

Creating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with ConcertoCreating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with Concerto
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_Ali
 
No struggle with test design (presentation at TestExpo 2015 Denmark)
No struggle with test design (presentation at TestExpo 2015 Denmark)No struggle with test design (presentation at TestExpo 2015 Denmark)
No struggle with test design (presentation at TestExpo 2015 Denmark)
 
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
 
Test design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTARTest design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTAR
 
Experimental design
Experimental designExperimental design
Experimental design
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
forecasting model
forecasting modelforecasting model
forecasting model
 
Computer Adaptive Test (cat)
Computer Adaptive Test (cat)Computer Adaptive Test (cat)
Computer Adaptive Test (cat)
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
 
Introduction to Computerized Adaptive Testing (CAT)
Introduction to Computerized Adaptive Testing (CAT)Introduction to Computerized Adaptive Testing (CAT)
Introduction to Computerized Adaptive Testing (CAT)
 

Viewers also liked

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Introduction
Bridging the Gap: Machine Learning for Ubiquitous Computing -- IntroductionBridging the Gap: Machine Learning for Ubiquitous Computing -- Introduction
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Introduction
Thomas Ploetz
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...
Thomas Ploetz
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
Thomas Ploetz
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...
Thomas Ploetz
 
Final Young Alert And Aware 2004 Cprs Conf
Final Young Alert And Aware 2004 Cprs ConfFinal Young Alert And Aware 2004 Cprs Conf
Final Young Alert And Aware 2004 Cprs Conf
guestb829390
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
Machine Learning Valencia
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Marina Santini
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
台灣資料科學年會
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Spark Summit
 

Viewers also liked (9)

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Introduction
Bridging the Gap: Machine Learning for Ubiquitous Computing -- IntroductionBridging the Gap: Machine Learning for Ubiquitous Computing -- Introduction
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Introduction
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...
 
Final Young Alert And Aware 2004 Cprs Conf
Final Young Alert And Aware 2004 Cprs ConfFinal Young Alert And Aware 2004 Cprs Conf
Final Young Alert And Aware 2004 Cprs Conf
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 

Similar to Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation

Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluation
khairulhuda242
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
PriyadharshiniG41
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdf
Emerson Ceras
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
rajalakshmi5921
 
Spc
SpcSpc
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your data
Alex Henderson
 
06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf
AlexanderLerch4
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
shuchismitjha2
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
Pier Luca Lanzi
 
Understanding statistics in laboratory quality control
Understanding statistics in laboratory quality controlUnderstanding statistics in laboratory quality control
Understanding statistics in laboratory quality control
Randox
 
featurers_Machinelearning___________.pdf
featurers_Machinelearning___________.pdffeaturers_Machinelearning___________.pdf
featurers_Machinelearning___________.pdf
AmirMohamedNabilSale
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
Pier Luca Lanzi
 
Catapult DOE Case Study
Catapult DOE Case StudyCatapult DOE Case Study
Catapult DOE Case Study
Larry Thompson, MfgT.
 
1b7 quality control
1b7 quality control1b7 quality control
1b7 quality control
AHMED NADIM JILANI
 
Black Box Testing.pdf
Black Box Testing.pdfBlack Box Testing.pdf
Black Box Testing.pdf
SupunLakshan4
 
Machine Learning Powered A/B Testing
Machine Learning Powered A/B TestingMachine Learning Powered A/B Testing
Machine Learning Powered A/B Testing
Pavel Serdyukov
 
Lecture3-eval.pptx
Lecture3-eval.pptxLecture3-eval.pptx
Lecture3-eval.pptx
SahilShahPhD2020
 
Lecture 3 for the AI course in A university
Lecture 3 for the AI course in A universityLecture 3 for the AI course in A university
Lecture 3 for the AI course in A university
Cao Minh Tu
 

Similar to Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation (20)

Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluation
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdf
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Spc
SpcSpc
Spc
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your data
 
06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
Understanding statistics in laboratory quality control
Understanding statistics in laboratory quality controlUnderstanding statistics in laboratory quality control
Understanding statistics in laboratory quality control
 
featurers_Machinelearning___________.pdf
featurers_Machinelearning___________.pdffeaturers_Machinelearning___________.pdf
featurers_Machinelearning___________.pdf
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
Catapult DOE Case Study
Catapult DOE Case StudyCatapult DOE Case Study
Catapult DOE Case Study
 
1b7 quality control
1b7 quality control1b7 quality control
1b7 quality control
 
Black Box Testing.pdf
Black Box Testing.pdfBlack Box Testing.pdf
Black Box Testing.pdf
 
Machine Learning Powered A/B Testing
Machine Learning Powered A/B TestingMachine Learning Powered A/B Testing
Machine Learning Powered A/B Testing
 
Lecture3-eval.pptx
Lecture3-eval.pptxLecture3-eval.pptx
Lecture3-eval.pptx
 
Lecture 3 for the AI course in A university
Lecture 3 for the AI course in A universityLecture 3 for the AI course in A university
Lecture 3 for the AI course in A university
 

Recently uploaded

Centrifugation types and its application
Centrifugation types and its applicationCentrifugation types and its application
Centrifugation types and its application
MDAsifKilledar
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
goluk9330
 
Call Girls Versova ♨️ +91-9920725232 👈Open 24/7 at Top Mumbai Call Girls Service
Call Girls Versova ♨️ +91-9920725232 👈Open 24/7 at Top Mumbai Call Girls ServiceCall Girls Versova ♨️ +91-9920725232 👈Open 24/7 at Top Mumbai Call Girls Service
Call Girls Versova ♨️ +91-9920725232 👈Open 24/7 at Top Mumbai Call Girls Service
bhuhariaqueen9pm$S2
 
BIOLOGY ANIMAL KINGDOM CLASS. 11 NCERT..
BIOLOGY ANIMAL KINGDOM CLASS. 11 NCERT..BIOLOGY ANIMAL KINGDOM CLASS. 11 NCERT..
BIOLOGY ANIMAL KINGDOM CLASS. 11 NCERT..
suriyaj2310
 
Buy Best T-shirts for Men Online Buy Best T-shirts for Men Online
Buy Best T-shirts for Men Online Buy Best T-shirts for Men OnlineBuy Best T-shirts for Men Online Buy Best T-shirts for Men Online
Buy Best T-shirts for Men Online Buy Best T-shirts for Men Online
janvi$L14
 
22PH503 - Astronomy and Astrophysics - Unit 2 - Spectral Classification of Stars
22PH503 - Astronomy and Astrophysics - Unit 2 - Spectral Classification of Stars22PH503 - Astronomy and Astrophysics - Unit 2 - Spectral Classification of Stars
22PH503 - Astronomy and Astrophysics - Unit 2 - Spectral Classification of Stars
RDhivya6
 
(Shilpa) ➤ Call Girls Lucknow 🔥 9352988975 🔥 Real Fun With Sexual Girl Availa...
(Shilpa) ➤ Call Girls Lucknow 🔥 9352988975 🔥 Real Fun With Sexual Girl Availa...(Shilpa) ➤ Call Girls Lucknow 🔥 9352988975 🔥 Real Fun With Sexual Girl Availa...
(Shilpa) ➤ Call Girls Lucknow 🔥 9352988975 🔥 Real Fun With Sexual Girl Availa...
shourabjaat424
 
23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference
RDhivya6
 
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptxSynopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
FarhanaHussain18
 
Ross Wilson solved MCQS (Watan Dost).pdf
Ross Wilson solved MCQS (Watan Dost).pdfRoss Wilson solved MCQS (Watan Dost).pdf
Ross Wilson solved MCQS (Watan Dost).pdf
Khyber medical university Peshawar
 
Noida Call Girls Number 9999965857 Vip Call Girls Lady Of Your Dream Ready To...
Noida Call Girls Number 9999965857 Vip Call Girls Lady Of Your Dream Ready To...Noida Call Girls Number 9999965857 Vip Call Girls Lady Of Your Dream Ready To...
Noida Call Girls Number 9999965857 Vip Call Girls Lady Of Your Dream Ready To...
choudharydenunisha
 
一比一原版(macewan学位证书)加拿大麦科文大学毕业证如何办理
一比一原版(macewan学位证书)加拿大麦科文大学毕业证如何办理一比一原版(macewan学位证书)加拿大麦科文大学毕业证如何办理
一比一原版(macewan学位证书)加拿大麦科文大学毕业证如何办理
xzydcvt
 
Embracing Deep Variability For Reproducibility and Replicability
Embracing Deep Variability For Reproducibility and ReplicabilityEmbracing Deep Variability For Reproducibility and Replicability
Embracing Deep Variability For Reproducibility and Replicability
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
20240610_INSIGHT_PartnerProfile-02_Tampere.pdf
20240610_INSIGHT_PartnerProfile-02_Tampere.pdf20240610_INSIGHT_PartnerProfile-02_Tampere.pdf
20240610_INSIGHT_PartnerProfile-02_Tampere.pdf
Steffi Friedrichs
 
SAP Unveils Generative AI Innovations at Annual Sapphire Conference
SAP Unveils Generative AI Innovations at Annual Sapphire ConferenceSAP Unveils Generative AI Innovations at Annual Sapphire Conference
SAP Unveils Generative AI Innovations at Annual Sapphire Conference
CGB SOLUTIONS
 
GBSN - Microbiology (Unit 2) Antimicrobial agents
GBSN - Microbiology (Unit 2) Antimicrobial agentsGBSN - Microbiology (Unit 2) Antimicrobial agents
GBSN - Microbiology (Unit 2) Antimicrobial agents
Areesha Ahmad
 
Rodents, Birds and locust_Pests of crops.pdf
Rodents, Birds and locust_Pests of crops.pdfRodents, Birds and locust_Pests of crops.pdf
Rodents, Birds and locust_Pests of crops.pdf
PirithiRaju
 
seed production, Nursery & Gardening.pdf
seed production, Nursery & Gardening.pdfseed production, Nursery & Gardening.pdf
seed production, Nursery & Gardening.pdf
Nistarini College, Purulia (W.B) India
 
Casein in different samples of milk chemistry project
Casein in different samples of milk chemistry projectCasein in different samples of milk chemistry project
Casein in different samples of milk chemistry project
tyagivansh251
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Sérgio Sacani
 

Recently uploaded (20)

Centrifugation types and its application
Centrifugation types and its applicationCentrifugation types and its application
Centrifugation types and its application
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
 
Call Girls Versova ♨️ +91-9920725232 👈Open 24/7 at Top Mumbai Call Girls Service
Call Girls Versova ♨️ +91-9920725232 👈Open 24/7 at Top Mumbai Call Girls ServiceCall Girls Versova ♨️ +91-9920725232 👈Open 24/7 at Top Mumbai Call Girls Service
Call Girls Versova ♨️ +91-9920725232 👈Open 24/7 at Top Mumbai Call Girls Service
 
BIOLOGY ANIMAL KINGDOM CLASS. 11 NCERT..
BIOLOGY ANIMAL KINGDOM CLASS. 11 NCERT..BIOLOGY ANIMAL KINGDOM CLASS. 11 NCERT..
BIOLOGY ANIMAL KINGDOM CLASS. 11 NCERT..
 
Buy Best T-shirts for Men Online Buy Best T-shirts for Men Online
Buy Best T-shirts for Men Online Buy Best T-shirts for Men OnlineBuy Best T-shirts for Men Online Buy Best T-shirts for Men Online
Buy Best T-shirts for Men Online Buy Best T-shirts for Men Online
 
22PH503 - Astronomy and Astrophysics - Unit 2 - Spectral Classification of Stars
22PH503 - Astronomy and Astrophysics - Unit 2 - Spectral Classification of Stars22PH503 - Astronomy and Astrophysics - Unit 2 - Spectral Classification of Stars
22PH503 - Astronomy and Astrophysics - Unit 2 - Spectral Classification of Stars
 
(Shilpa) ➤ Call Girls Lucknow 🔥 9352988975 🔥 Real Fun With Sexual Girl Availa...
(Shilpa) ➤ Call Girls Lucknow 🔥 9352988975 🔥 Real Fun With Sexual Girl Availa...(Shilpa) ➤ Call Girls Lucknow 🔥 9352988975 🔥 Real Fun With Sexual Girl Availa...
(Shilpa) ➤ Call Girls Lucknow 🔥 9352988975 🔥 Real Fun With Sexual Girl Availa...
 
23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference23PH301 - Optics - Unit 2 - Interference
23PH301 - Optics - Unit 2 - Interference
 
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptxSynopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
 
Ross Wilson solved MCQS (Watan Dost).pdf
Ross Wilson solved MCQS (Watan Dost).pdfRoss Wilson solved MCQS (Watan Dost).pdf
Ross Wilson solved MCQS (Watan Dost).pdf
 
Noida Call Girls Number 9999965857 Vip Call Girls Lady Of Your Dream Ready To...
Noida Call Girls Number 9999965857 Vip Call Girls Lady Of Your Dream Ready To...Noida Call Girls Number 9999965857 Vip Call Girls Lady Of Your Dream Ready To...
Noida Call Girls Number 9999965857 Vip Call Girls Lady Of Your Dream Ready To...
 
一比一原版(macewan学位证书)加拿大麦科文大学毕业证如何办理
一比一原版(macewan学位证书)加拿大麦科文大学毕业证如何办理一比一原版(macewan学位证书)加拿大麦科文大学毕业证如何办理
一比一原版(macewan学位证书)加拿大麦科文大学毕业证如何办理
 
Embracing Deep Variability For Reproducibility and Replicability
Embracing Deep Variability For Reproducibility and ReplicabilityEmbracing Deep Variability For Reproducibility and Replicability
Embracing Deep Variability For Reproducibility and Replicability
 
20240610_INSIGHT_PartnerProfile-02_Tampere.pdf
20240610_INSIGHT_PartnerProfile-02_Tampere.pdf20240610_INSIGHT_PartnerProfile-02_Tampere.pdf
20240610_INSIGHT_PartnerProfile-02_Tampere.pdf
 
SAP Unveils Generative AI Innovations at Annual Sapphire Conference
SAP Unveils Generative AI Innovations at Annual Sapphire ConferenceSAP Unveils Generative AI Innovations at Annual Sapphire Conference
SAP Unveils Generative AI Innovations at Annual Sapphire Conference
 
GBSN - Microbiology (Unit 2) Antimicrobial agents
GBSN - Microbiology (Unit 2) Antimicrobial agentsGBSN - Microbiology (Unit 2) Antimicrobial agents
GBSN - Microbiology (Unit 2) Antimicrobial agents
 
Rodents, Birds and locust_Pests of crops.pdf
Rodents, Birds and locust_Pests of crops.pdfRodents, Birds and locust_Pests of crops.pdf
Rodents, Birds and locust_Pests of crops.pdf
 
seed production, Nursery & Gardening.pdf
seed production, Nursery & Gardening.pdfseed production, Nursery & Gardening.pdf
seed production, Nursery & Gardening.pdf
 
Casein in different samples of milk chemistry project
Casein in different samples of milk chemistry projectCasein in different samples of milk chemistry project
Casein in different samples of milk chemistry project
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
 

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation

  • 1. Evaluation* in ubiquitous computing * (quantitative evaluation, qualitative studies are a field of their own!) Nils Hammerla, Research Associate @ Open Lab
  • 2. Evaluation? Why? • Performance evaluation • How well does it work? • Model selection • Which approach works best? • Parameter selection • How do I tune my approach for best results? • Demonstration of feasibility • Is it worthwhile investing more time/money?
  • 3. Training and test-sets Basic elements, segments, images, etc.
  • 4. Training and test-sets • Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining validation
  • 5. Training and test-sets • Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining User 1 User 2 User 3 User 4 User 5 User 6 validation
  • 6. Training and test-sets • Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining User 1 User 2 User 3 User 4 User 5 User 6 validation Run 1,2,3 Run 4
  • 7. Training and test-sets • Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining User 1 User 2 User 3 User 4 User 5 User 6 validation Run 1,2,3 Run 4 • User-dependent performance
 How would the system perform for more data from a known user?
 • User-independent performance
 How would the system perform for a new user?
  • 8. Training and test-sets • Repeated hold-out • Split data into k continuous folds • A split into n folds leads to n performance estimates • Gives more reliable performance estimate • But: this depends on how the folds are constructed! • Variants • Leave-one-subject-out (LOSO) — User-independent • Leave-one-run-out — User-dependent
  • 9. Training and test-sets • (Random, stratified) cross-validation • Split data into k folds, on the lowest level • Folds constructed to be stratified w.r.t. class distribution • Popular in general machine learning • Standard approach in many ML frameworks • User-dependent performance estimate • Requires samples to be statistically independent • This is not the case in most cases in ubicomp!
  • 10. Pitfalls • User dependent vs independent performance • Assumption: User dependent performance is the upper bound of possible system performance • Cross-validation therefore popular in ubicomp to demonstrate feasibility of e.g. new technical approach study 1 2 3 4 5 6 accuracy 0 10 20 30 40 50 60 70 80 90 100 Loso vs xval performance in related work xval loso
  • 11. Pitfalls • Cross-validation in segmented time-series • When testing for a segment i, it is very likely that segments i-1 or i +1 are in the training set. • Neighbouring segments typically overlap • They are therefore very similar! • This biases the results and leads to bloated performance figures
  • 12. Pitfalls • Cross-validation in segmented time-series • This is an issue, as cross-validation is widely used • For user-dependent evaluation • For parameter estimation (e.g. in neural networks) • A simple extension of cross-validation alleviates the bias • There will be a talk on this on Friday morning!
  • 13. Pitfalls • Reporting results • Always report means and standard deviation of results • T-tests for difference in mean performance • Whether t-tests on cross-validated results are meaningful is still debated in statistics • You have to use the same folds when comparing different approaches! • Better: repeat whole experiment multiple times (if computationally feasible)
  • 14. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives
  • 15. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives
  • 16. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A:
  • 17. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A:
  • 18. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A: B:
  • 19. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A: B:
  • 20. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A: B: C:
  • 21. Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: Precision / PPV tp / (tp + fp) Recall / Sensitivity tp / (tp + fn) Specificity tn / (tn + fp) Accuracy (tp+tn) / (tp + fp + fn + tn) F1-score 2*prec*sens / (prec+sens) / ( + ) / ( + ) / ( + ) For each class (or for two class problems): ( + )/( + + + )
  • 22. Overall accuracy Mean accuracy Weighted F1-score Mean F1-score Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: For a set of classes C = {A,B,C,…}: 1 |C| X c2C accc 1 N X c2C tpc 2 N X c2C nc precc ⇥ sensc precc + sensc 2 |C| X c2C precc ⇥ sensc precc + sensc
  • 23. Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: Normalised confusion matrices - Divide each row by its sum to get confusion probabilities 3284 449 311 66 446 1275 1600 319 133 647 2865 285 24 198 715 412 a) Confusion on HOME data-set prediction asleep off on dys annotation(diary) asleep off on dys 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 13 155 180 39 101 414 117 3 3 164 b) Confusion on LAB data-set prediction asleep off on dys annotation(clinician) asleep off on dys 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 [Hammerla, Nils Y., et al. "PD Disease State Assessment in Naturalistic Environments using Deep Learning." AAAI 2015]
  • 24. Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: Receiver Operator Characteristics (ROC) - Illustrates trade-off between 
 True Positive Rate (sensitivity / recall), and False Positive Rate (1 - specificity) - Useful if approach has a simple parameter, like a threshold. 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 false positive rate truepositiverate ROC curves for different classifiers knn c45 logR PCA−logR [Ladha, Cassim, et al. "ClimbAX: skill assessment for climbing enthusiasts." Ubicomp 2013]
  • 25. Log likelihood AIC / BIC KL-divergence Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: Probabilistic models: Mean Square Error (MSE) Root Mean Square Error (RMSE) Residuals Correlation coefficients Regression:
  • 26. Example 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C A B C PPV 0.77 0.89 0.44 Sensitivity 0.87 0.63 0.66 Specificity 0.71 0.95 0.92 Accuracy 0.79 0.83 0.90 F1-score 0.81 0.73 0.53 Overall Accuracy 0.76 Weighted F1-score 0.76 Mean F1-score 0.69 prediction A B C groundtruth A B C 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 27. Which metric should I not use? • Specificity, or per-class accuracy • Less meaningful with an increase in #classes • #TN cells grow in O(n2), • #FN, #FP cells grow in O(n)! • Leads to e.g. incredible specificity figures (> 0.99) / ( + ) ( + )/( + + + )
  • 28. Which metric should I not use? • Specificity, or per-class accuracy • Less meaningful with an increase in #classes • #TN cells grow in O(n2), • #FN, #FP cells grow in O(n)! • Leads to e.g. incredible specificity figures (> 0.99) / ( + ) ( + )/( + + + ) / ( + ) / ( + )PPV or Recall Better:
  • 29. Which metric should I not use? • Overall accuracy • In unbalanced data-sets, overall accuracy can be misleading 1 N X c2C tpc 10,000 10 2 100 1 0 20 1 0 Overall accuracy = 0.99 Weighted F1-score = 0.98 Mean F1-score = 0.34 9,950 40 22 40 45 15 10 1 10 Overall accuracy = 0.99 Weighted F1-score = 0.99 Mean F1-score = 0.59 < << ~=
  • 30. Which metric should I use? Balanced Unbalanced Two-class - Accuracy - F1-score - Sensitivity, Specificity - PPV, Recall for rare class - Mean F1-score Multi-class - Accuracy - Weighted F1 - Mean F1 - Avg. PPV, Sensitivity - Mean F1 vs weighted F1 - Confusion matrices No single metric reflects all aspects of performance!
  • 31. Which metric should I use? • Application requirements • Different applications have different requirements • What type of error is more critical? • False positives can hurt if activities of interest rare • Low precision can be a problem in medical applications • There is always a trade-off between different aspects of performance
  • 32. Which metric should I use? • Application requirements • Different applications have different requirements • What type of error is more critical? • False positives can hurt if activities of interest rare • Low precision can be a problem in medical applications • There is always a trade-off between different aspects of performance
  • 33. Ward et al. (2011) • Continuous recognition • Measures like accuracy treat each error the same • But: The predictions are time-series themselves, and different errors are possible • Which type of error is more acceptable in my application? [Jamie Ward et al., Performance Metrics for Activity Recognition]
  • 34. Ward et al. (2011) • Continuous recognition • Similar to bioinformatics: • Merge, Deletion, Confusion, Fragmentation, Insertion • Each type of error is scored differently (cost matrix) • Compound score reflects (adjustable) performance metric • Systems that show very good performance in regular metrics may show poor performance in one of these aspects! [Jamie Ward et al., Performance Metrics for Activity Recognition]
  • 35. Summary • Careful when selecting training and test-set • Different aspects of performance (user dependent / independent) • Avoid random cross-validation for segmented time-series! • Performance metrics • No single metric reflects all aspects of performance • Depends on your application! • More difficult for unbalanced multi-class problems • Continuous recognition • Different types of errors have different impact on perceived performance of a system • Scoring systems can be tailored for individual application
  翻译: