尊敬的 微信汇率:1円 ≈ 0.046089 元 支付宝汇率:1円 ≈ 0.04618元 [退出登录]
SlideShare a Scribd company logo
The Zen of Predictive Modelling
Eugene Dubossarsky
What This Talk Isn’t About
But worth mentioning anyway:
R and The Sydney Users of R Forum
Analyst First
My Courses
Sydney Users of R Forum
• Just 1 shy of 500 members
• Regular meetups
• Study groups: introduction to R, “Machine
Learning for Hackers”, “Elements of Statistical
• Do a Google image search for “ggplot2”
• Look for “r4stats”, “popularity”
• Join SURF
• Download R and start using it.
Analyst First
• Strategic, Cultural, Organisational, Human issues in
• Making analytics work in organisations
• Focus on the Human side of analytics
• International : Aust, NZ, Singapore, US, Japan, India, Hong
• analystfirst.com – see “core principles” and “what is analyst
first” ?
My Analytics Training Courses
• Predictive Modelling, Data Mining, R, Forensic
Analytics, Visualisation, Forecasting training courses
• Sydney, Melbourne, Canberra, Singapore
• Public and in-house
• Pre-prepared or customised
• Informal coaching/mentoring
• Strategy, Review, Advice and Assistance with Analytics Capability
Development in your organisation
The Zen of Predictive Modelling
• The Most Important Part of My “Predictive Modelling and Data Mining Course”
• What every user of predictive modelling should know
• What every manager and owner of predictive modelling capability must know
• “Open Secrets” known to the masters
The Zen of Predictive Modelling
• To save people time
• To see the forest for the trees
• To real value out of predictive analytics
The Right Point of View
Which is unlike the other two?
• Kohonen neural network
• Backpropagation neural network
• CART decision tree
The Right Point of View
Which is unlike the other two?
• CART decision tree
• Random Forest
• Support Vector Machine
The Right Point of View
Which is unlike the other two?
• Backpropagation Neural Network
• Linear Model
• CART Decision Tree
The Right Point of View
• Out Of Sample Accuracy
• Robustness (Out of Time Accuracy)
• Interpretability
• Implementability
The Right Point of View
• Out Of Sample Accuracy
• Robustness (Out of Time Accuracy)
• Interpretability
• Implementability
The Right Point of View
• Out Of Sample Accuracy
• Robustness (Out of Time Accuracy)
• Interpretability
• Implementability
The Right Point of View
Why build predictive models ?
• Insights
• Operational prediction
• “What-if” analysis
What Do All Predictive Models Have in Common ?
All Predictive Models:
• Have a training set of predictors and outcomes
• Probably have a cross-validation and test set of predictors and outcomes too.
• Are “fit” (optimsied) to minimise an error function between their actual and target
• Are probably cross-validated to control overfitting on an out-of-sample data set
• Provide information on the relationship between the predictors and outcomes in
the data
• Can be used to score new data (make new predictions)
• Can be deployed in IT systems
• Can be interrogated for insights
• Are only as accurate as the data allows
• Provide a (fairly) accurate estimate of how well they will predict on new data
What Do All Predictive Model Insights Have in Common ?
All Predictive Models:
• Have variable importance measures (a number of which can be applied to any
• Allow plotting predictors vs outcomes
• Have variable accuracy measures
• Can be resampled for more robust measures of accuracy
What Do All Predictive Model Predictions Have in Common?
All Predictive Models:
• Make predictions that are numeric : estimates of amount for regression, and
probability for classification
• All predictions are applications of the underlying model structure and parameters
(formula) to new predictor data sets
• All predictions are deterministic. Once a model is fitted, the predictions for a given
record will be the same every time. (Though the prediction may be a distribution
rather than a fixed point. Also, note that model fitting itself may be random – some
models may differ slightly each time they are fitted to the same data set)
How Do Predictive Model Families Differ?
• Classification vs Regression (most families can do both)
• Predictive accuracy vs insights
• Predictive accuracy vs stability
• Deterministic fitting vs randomised fitting
• Specific insights
• Structure and complexity
• Model assumptions (linear models, neural nets)
• Model structure (trees vs additive models vs SVM vs Neural Nets etc)
• The kinds of insights models provide
• Tendency to overfit (most, but not all)
• Dependence on metrics
• Sensitivity to missing values and categorical variables
Becoming a Master of Modelling Kung Fu
• Predictive models should be thought of as a “black box” initially, with the
characteristics that all models have in common recognised
• The focus should be on the data, not the model.
• Focusing on the specific characteristics of the model is important when: deciding on
the degree of accuracy desired, and the kinds of insights desired.
• It is good to start by working with one highly accurate, simple to use method
(randomForest is a good choice) and one or two highly interpretable models (rpart
decision trees and (generalised) linear models are good here.
• In fact, you can go a long way with just randomForest alone.
Becoming a Master of Modelling Kung Fu
• Master an adequate tool.
• Empty your mind of the tool . It is an illusion.
• Meditate on the data.
Meditating on Data
• Start with a highly accurate, nonparametric model you are comfortable with.
• The accuracy of a highly accuarate method is close to the theoretical limit of
accuracy possible on the data. World class experts may get closer, but not a whole
lot closer.
• So once you build the model, forget about the specific family you used. It is just a
• Each predictor may provide a unique amount of predictability to the model.
Measure it.
• Each predictor may be masked by other predictors. Be careful.
• Check relationships between data and strongest predictors
Meditating on Data
• There are at least 3 ways that a predictor can be important. They are not the same:
• What is the unique contribution of the predictor to the accuracy of the model
• What is the individual predictive power of the predictor alone ?
• How vital is the predictor to the structure of a particular model ?
• The first two are about the data, the third is more about the specific model. Which
is more important ?
Meditating on Data
• There are at least 3 ways that a predictor can be important. They are not the same:
• What is the unique contribution of the predictor to the accuracy of the model
• What is the individual predictive power of the predictor alone ?
• How vital is the predictor to the structure of a particular model ?
• The first two are about the data, the third is more about the specific model. Which
is more important ?
The Predictive Modelling Master’s Data Meditation
• Start with a highly accurate, nonparametric model you are comfortable with.
• The accuracy of a highly accuarate method is close to the theoretical limit of
accuracy possible on the data. World class experts may get closer, but not a whole
lot closer.
• So once you build the model, forget about the specific family you used. It is just a
• Measure model accuracy on out-of-sample data. Pay attention to any imbalances in
class or data subset accuracy.
• Measure model stability if necessary (it almost always is)
• Measure the importance of all variables, using the three main techniques.
• Measure again, holding some of the main predictors constant
• Measure (visualise) the effects of each predictor
• Build an interpretable model to help tell the story
The Master Sharpens the Sword : Getting More Accuracy
• There is never enough data
• Some model accuracy can result from trying other model families. Usually not
much, and not the best use of time, though for some reason the favourite activity of
new data miners.
• Some more model accuracy can result from tweaking model parameters. This is
perhaps less of a waste of time, but still not the ideal focus.
• The most dramatic improvement in model accuracy comes from new predictors.
• New predictors may be entirely new data sets, or complex new transformations of
existing data.
• A large, multi-tabular data set may well have information that has not been
captured in the data.
• The most common information of this type involves relations between individual
records. (eg. Time series windows, geographic neighbourhoods or social network
statistics per record)
Illusions On the Path
• Colossal wastes of time can include
• Trying to find the “right” model family
• Getting stuck in data preprocessing trying to get all the predictors “right”
• Trying to figure out what the targets should be (usually a sign that the business
problem is not well understood)
• Trying to “improve” the model without defining what that means
The Sun Tzu of Modelling: Be Prepared
• Know what you are modelling and for what purpose.
• Know what your target variable is. You may have more than one.
• Do not hesitate, model with what you have, and add more predictors later.
• Messy data is better than no data
• Use the right error measures
• Know the connection between the model and your business
• Evaluate, interrogate the model accordingly
• Always question the business value of the analysis
• Always be ready to suggest the business use of the analysis
• Don’t assume that the client understands what to do with the model
Strategy and Tactics
• Why are you (re)building the model?
• If Strategic: what is going to be done with the insights ? By whom ?
• If Operational: what are the key metrics – accuracy, value, deployability?
Questions ?

More Related Content

What's hot

Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
David Murgatroyd
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
Sara Hooker
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Sri Ambati
6 Modelling Purposes
6 Modelling Purposes6 Modelling Purposes
6 Modelling Purposes
Bruce Edmonds
Artificial Intelligence Approaches
Artificial Intelligence  ApproachesArtificial Intelligence  Approaches
Artificial Intelligence Approaches
Jincy Nelson
Rm tutorial
Rm tutorialRm tutorial
Rm tutorial
Decision support systems
Decision support systemsDecision support systems
Decision support systems
910 plenary Elder
910 plenary Elder910 plenary Elder
910 plenary Elder
Rising Media, Inc.
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
Roger Barga
Analytical Skills Tools and Attitudes 2013 Survey lavastorm analytics
Analytical Skills Tools and Attitudes 2013 Survey   lavastorm analyticsAnalytical Skills Tools and Attitudes 2013 Survey   lavastorm analytics
Analytical Skills Tools and Attitudes 2013 Survey lavastorm analytics
Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data Science
MLSEV Virtual. State of the Art in ML
MLSEV Virtual. State of the Art in MLMLSEV Virtual. State of the Art in ML
MLSEV Virtual. State of the Art in ML
BigML, Inc
RESEARCH in software engineering
RESEARCH in software engineeringRESEARCH in software engineering
RESEARCH in software engineering
Ivano Malavolta
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
Sara Hooker
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
IDEAS - Int'l Data Engineering and Science Association
Presentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlalPresentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlal
Machine Learning
Machine LearningMachine Learning
Machine Learning
Darshan Ambhaikar
Machine learning it is time...
Machine learning it is time...Machine learning it is time...
Machine learning it is time...
Sandip Chatterjee
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...

What's hot (20)

Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
6 Modelling Purposes
6 Modelling Purposes6 Modelling Purposes
6 Modelling Purposes
Artificial Intelligence Approaches
Artificial Intelligence  ApproachesArtificial Intelligence  Approaches
Artificial Intelligence Approaches
Rm tutorial
Rm tutorialRm tutorial
Rm tutorial
Decision support systems
Decision support systemsDecision support systems
Decision support systems
910 plenary Elder
910 plenary Elder910 plenary Elder
910 plenary Elder
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
Analytical Skills Tools and Attitudes 2013 Survey lavastorm analytics
Analytical Skills Tools and Attitudes 2013 Survey   lavastorm analyticsAnalytical Skills Tools and Attitudes 2013 Survey   lavastorm analytics
Analytical Skills Tools and Attitudes 2013 Survey lavastorm analytics
Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data Science
MLSEV Virtual. State of the Art in ML
MLSEV Virtual. State of the Art in MLMLSEV Virtual. State of the Art in ML
MLSEV Virtual. State of the Art in ML
RESEARCH in software engineering
RESEARCH in software engineeringRESEARCH in software engineering
RESEARCH in software engineering
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Presentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlalPresentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlal
Machine Learning
Machine LearningMachine Learning
Machine Learning
Machine learning it is time...
Machine learning it is time...Machine learning it is time...
Machine learning it is time...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...

Viewers also liked

Best Practices In Predictive Analytics
Best Practices In Predictive AnalyticsBest Practices In Predictive Analytics
Best Practices In Predictive Analytics
Seismi Case Study | Oracle Mining Event | Santiago de Chile | 15 March 2012
Seismi Case Study | Oracle Mining Event | Santiago de Chile | 15 March 2012Seismi Case Study | Oracle Mining Event | Santiago de Chile | 15 March 2012
Seismi Case Study | Oracle Mining Event | Santiago de Chile | 15 March 2012
Seismi Limited
5 Benefits of Predictive Analytics for E-Commerce
5 Benefits of Predictive Analytics for E-Commerce5 Benefits of Predictive Analytics for E-Commerce
5 Benefits of Predictive Analytics for E-Commerce
Creating Your First Predictive Model In Python
Creating Your First Predictive Model In PythonCreating Your First Predictive Model In Python
Creating Your First Predictive Model In Python
Robert Dempsey
Webinar: The Whys and Hows of Predictive Modelling
Webinar: The Whys and Hows of Predictive Modelling Webinar: The Whys and Hows of Predictive Modelling
Webinar: The Whys and Hows of Predictive Modelling
Presentation Churn Management
Presentation Churn ManagementPresentation Churn Management
Presentation Churn Management
Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)
Dmitry Efimov
Predictive analytics for E-commerce
Predictive analytics for E-commerce Predictive analytics for E-commerce
Predictive analytics for E-commerce
Niyuj - Delivering innovation
churn prediction in telecom
churn prediction in telecom churn prediction in telecom
churn prediction in telecom
Hong Bui Van
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Nick Brown
Amazon Machine Learning Case Study: Predicting Customer Churn
Amazon Machine Learning Case Study: Predicting Customer ChurnAmazon Machine Learning Case Study: Predicting Customer Churn
Amazon Machine Learning Case Study: Predicting Customer Churn
Amazon Web Services
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
Amazon Web Services
Churn management
Churn managementChurn management
Churn management
Mohammed Akram Ayyubi
Data analytics telecom churn final ppt
Data analytics telecom churn final ppt Data analytics telecom churn final ppt
Data analytics telecom churn final ppt
Gunvansh Khanna
Churn Predictive Modelling
Churn Predictive ModellingChurn Predictive Modelling
Churn Predictive Modelling
Hugo E. Cisternas
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Xavier Amatriain
Big data ppt
Big data pptBig data ppt
Big data ppt
IDBI Bank Ltd.

Viewers also liked (17)

Best Practices In Predictive Analytics
Best Practices In Predictive AnalyticsBest Practices In Predictive Analytics
Best Practices In Predictive Analytics
Seismi Case Study | Oracle Mining Event | Santiago de Chile | 15 March 2012
Seismi Case Study | Oracle Mining Event | Santiago de Chile | 15 March 2012Seismi Case Study | Oracle Mining Event | Santiago de Chile | 15 March 2012
Seismi Case Study | Oracle Mining Event | Santiago de Chile | 15 March 2012
5 Benefits of Predictive Analytics for E-Commerce
5 Benefits of Predictive Analytics for E-Commerce5 Benefits of Predictive Analytics for E-Commerce
5 Benefits of Predictive Analytics for E-Commerce
Creating Your First Predictive Model In Python
Creating Your First Predictive Model In PythonCreating Your First Predictive Model In Python
Creating Your First Predictive Model In Python
Webinar: The Whys and Hows of Predictive Modelling
Webinar: The Whys and Hows of Predictive Modelling Webinar: The Whys and Hows of Predictive Modelling
Webinar: The Whys and Hows of Predictive Modelling
Presentation Churn Management
Presentation Churn ManagementPresentation Churn Management
Presentation Churn Management
Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)
Predictive analytics for E-commerce
Predictive analytics for E-commerce Predictive analytics for E-commerce
Predictive analytics for E-commerce
churn prediction in telecom
churn prediction in telecom churn prediction in telecom
churn prediction in telecom
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Amazon Machine Learning Case Study: Predicting Customer Churn
Amazon Machine Learning Case Study: Predicting Customer ChurnAmazon Machine Learning Case Study: Predicting Customer Churn
Amazon Machine Learning Case Study: Predicting Customer Churn
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
Churn management
Churn managementChurn management
Churn management
Data analytics telecom churn final ppt
Data analytics telecom churn final ppt Data analytics telecom churn final ppt
Data analytics telecom churn final ppt
Churn Predictive Modelling
Churn Predictive ModellingChurn Predictive Modelling
Churn Predictive Modelling
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Big data ppt
Big data pptBig data ppt
Big data ppt

Similar to The zen of predictive modelling

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Knowledge And Skill Forum
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Niko Vuokko
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
Salford Systems
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicImproving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Dave Litwiller
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
Michael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems RapidlyMichael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems Rapidly
TEST Huddle
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
Roger Barga
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Science 101
Data Science 101Data Science 101
Data Science 101
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random Intercepts
Scott Fraundorf
Lecture 1
Lecture 1Lecture 1
Lecture 1
Aun Akbar
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
Cloudera, Inc.
Data Analysis
Data AnalysisData Analysis
Ml2 production
Ml2 productionMl2 production
Ml2 production
Nikhil Ketkar

Similar to The zen of predictive modelling (20)

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - PublicImproving AI Development - Dave Litwiller - Jan 11 2022 - Public
Improving AI Development - Dave Litwiller - Jan 11 2022 - Public
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
Michael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems RapidlyMichael Bolton - Heuristics: Solving Problems Rapidly
Michael Bolton - Heuristics: Solving Problems Rapidly
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
Data Science 101
Data Science 101Data Science 101
Data Science 101
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random Intercepts
Lecture 1
Lecture 1Lecture 1
Lecture 1
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
Data Analysis
Data AnalysisData Analysis
Data Analysis
Ml2 production
Ml2 productionMl2 production
Ml2 production

Recently uploaded

Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
Neeraj Kumar Singh
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
John Sterrett
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer ExperienceHow to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
Ubuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdfUbuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdf
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
Enterprise Knowledge
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
Brightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentationBrightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentation
Product Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdfProduct Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdf
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
manji sharman06

Recently uploaded (20)

Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer ExperienceHow to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
Ubuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdfUbuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdf
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
Brightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentationBrightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentation
Product Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdfProduct Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdf
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...

The zen of predictive modelling

  • 1. PresciientTraining The Zen of Predictive Modelling Eugene Dubossarsky eugene@presciient.com +61414573322 @cargomoose
  • 2. What This Talk Isn’t About But worth mentioning anyway: R and The Sydney Users of R Forum Analyst First My Courses
  • 3. Sydney Users of R Forum • Just 1 shy of 500 members • Regular meetups • Study groups: introduction to R, “Machine Learning for Hackers”, “Elements of Statistical Learning”
  • 4.
  • 5. R • Do a Google image search for “ggplot2” • Look for “r4stats”, “popularity” • Join SURF • Download R and start using it.
  • 6.
  • 7. Analyst First • Strategic, Cultural, Organisational, Human issues in analytics • Making analytics work in organisations • Focus on the Human side of analytics • International : Aust, NZ, Singapore, US, Japan, India, Hong Kong • analystfirst.com – see “core principles” and “what is analyst first” ?
  • 8. My Analytics Training Courses • Predictive Modelling, Data Mining, R, Forensic Analytics, Visualisation, Forecasting training courses • Sydney, Melbourne, Canberra, Singapore • Public and in-house • Pre-prepared or customised • Informal coaching/mentoring • Strategy, Review, Advice and Assistance with Analytics Capability Development in your organisation
  • 9. The Zen of Predictive Modelling PredictiveModels • The Most Important Part of My “Predictive Modelling and Data Mining Course” • What every user of predictive modelling should know • What every manager and owner of predictive modelling capability must know • “Open Secrets” known to the masters
  • 10. The Zen of Predictive Modelling PredictiveModels • To save people time • To see the forest for the trees • To real value out of predictive analytics
  • 11. The Right Point of View PredictiveModels Which is unlike the other two? • Kohonen neural network • Backpropagation neural network • CART decision tree
  • 12. The Right Point of View PredictiveModels Which is unlike the other two? • CART decision tree • Random Forest • Support Vector Machine
  • 13. The Right Point of View PredictiveModels Which is unlike the other two? • Backpropagation Neural Network • Linear Model • CART Decision Tree
  • 14. The Right Point of View PredictiveModels • Out Of Sample Accuracy • Robustness (Out of Time Accuracy) • Interpretability • Implementability
  • 15. The Right Point of View PredictiveModels • Out Of Sample Accuracy • Robustness (Out of Time Accuracy) • Interpretability • Implementability
  • 16. The Right Point of View PredictiveModels • Out Of Sample Accuracy • Robustness (Out of Time Accuracy) • Interpretability • Implementability
  • 17. The Right Point of View PredictiveModels Why build predictive models ? • Insights • Operational prediction • “What-if” analysis
  • 18. What Do All Predictive Models Have in Common ? PredictiveModels All Predictive Models: • Have a training set of predictors and outcomes • Probably have a cross-validation and test set of predictors and outcomes too. • Are “fit” (optimsied) to minimise an error function between their actual and target outcomes • Are probably cross-validated to control overfitting on an out-of-sample data set • Provide information on the relationship between the predictors and outcomes in the data • Can be used to score new data (make new predictions) • Can be deployed in IT systems • Can be interrogated for insights • Are only as accurate as the data allows • Provide a (fairly) accurate estimate of how well they will predict on new data
  • 19. What Do All Predictive Model Insights Have in Common ? PredictiveModels All Predictive Models: • Have variable importance measures (a number of which can be applied to any model) • Allow plotting predictors vs outcomes • Have variable accuracy measures • Can be resampled for more robust measures of accuracy
  • 20. What Do All Predictive Model Predictions Have in Common? PredictiveModels All Predictive Models: • Make predictions that are numeric : estimates of amount for regression, and probability for classification • All predictions are applications of the underlying model structure and parameters (formula) to new predictor data sets • All predictions are deterministic. Once a model is fitted, the predictions for a given record will be the same every time. (Though the prediction may be a distribution rather than a fixed point. Also, note that model fitting itself may be random – some models may differ slightly each time they are fitted to the same data set)
  • 21. How Do Predictive Model Families Differ? PredictiveModels • Classification vs Regression (most families can do both) • Predictive accuracy vs insights • Predictive accuracy vs stability • Deterministic fitting vs randomised fitting • Specific insights • Structure and complexity • Model assumptions (linear models, neural nets) • Model structure (trees vs additive models vs SVM vs Neural Nets etc) • The kinds of insights models provide • Tendency to overfit (most, but not all) • Dependence on metrics • Sensitivity to missing values and categorical variables
  • 22. Becoming a Master of Modelling Kung Fu PredictiveModels • Predictive models should be thought of as a “black box” initially, with the characteristics that all models have in common recognised • The focus should be on the data, not the model. • Focusing on the specific characteristics of the model is important when: deciding on the degree of accuracy desired, and the kinds of insights desired. • It is good to start by working with one highly accurate, simple to use method (randomForest is a good choice) and one or two highly interpretable models (rpart decision trees and (generalised) linear models are good here. • In fact, you can go a long way with just randomForest alone.
  • 23. Becoming a Master of Modelling Kung Fu PredictiveModels • Master an adequate tool. • Empty your mind of the tool . It is an illusion. • Meditate on the data.
  • 24. Meditating on Data PredictiveModels • Start with a highly accurate, nonparametric model you are comfortable with. • The accuracy of a highly accuarate method is close to the theoretical limit of accuracy possible on the data. World class experts may get closer, but not a whole lot closer. • So once you build the model, forget about the specific family you used. It is just a tool. • Each predictor may provide a unique amount of predictability to the model. Measure it. • Each predictor may be masked by other predictors. Be careful. • Check relationships between data and strongest predictors
  • 25. Meditating on Data PredictiveModels • There are at least 3 ways that a predictor can be important. They are not the same: • What is the unique contribution of the predictor to the accuracy of the model ? • What is the individual predictive power of the predictor alone ? • How vital is the predictor to the structure of a particular model ? • The first two are about the data, the third is more about the specific model. Which is more important ?
  • 26. Meditating on Data PredictiveModels • There are at least 3 ways that a predictor can be important. They are not the same: • What is the unique contribution of the predictor to the accuracy of the model ? • What is the individual predictive power of the predictor alone ? • How vital is the predictor to the structure of a particular model ? • The first two are about the data, the third is more about the specific model. Which is more important ?
  • 27. The Predictive Modelling Master’s Data Meditation PredictiveModels • Start with a highly accurate, nonparametric model you are comfortable with. • The accuracy of a highly accuarate method is close to the theoretical limit of accuracy possible on the data. World class experts may get closer, but not a whole lot closer. • So once you build the model, forget about the specific family you used. It is just a tool. • Measure model accuracy on out-of-sample data. Pay attention to any imbalances in class or data subset accuracy. • Measure model stability if necessary (it almost always is) • Measure the importance of all variables, using the three main techniques. • Measure again, holding some of the main predictors constant • Measure (visualise) the effects of each predictor • Build an interpretable model to help tell the story
  • 28. The Master Sharpens the Sword : Getting More Accuracy PredictiveModels • There is never enough data • Some model accuracy can result from trying other model families. Usually not much, and not the best use of time, though for some reason the favourite activity of new data miners. • Some more model accuracy can result from tweaking model parameters. This is perhaps less of a waste of time, but still not the ideal focus. • The most dramatic improvement in model accuracy comes from new predictors. • New predictors may be entirely new data sets, or complex new transformations of existing data. • A large, multi-tabular data set may well have information that has not been captured in the data. • The most common information of this type involves relations between individual records. (eg. Time series windows, geographic neighbourhoods or social network statistics per record)
  • 29. Illusions On the Path PredictiveModels • Colossal wastes of time can include • Trying to find the “right” model family • Getting stuck in data preprocessing trying to get all the predictors “right” • Trying to figure out what the targets should be (usually a sign that the business problem is not well understood) • Trying to “improve” the model without defining what that means
  • 30. The Sun Tzu of Modelling: Be Prepared PredictiveModels • Know what you are modelling and for what purpose. • Know what your target variable is. You may have more than one. • Do not hesitate, model with what you have, and add more predictors later. • Messy data is better than no data • Use the right error measures • Know the connection between the model and your business • Evaluate, interrogate the model accordingly • Always question the business value of the analysis • Always be ready to suggest the business use of the analysis • Don’t assume that the client understands what to do with the model
  • 31. Strategy and Tactics PredictiveModels • Why are you (re)building the model? • If Strategic: what is going to be done with the insights ? By whom ? • If Operational: what are the key metrics – accuracy, value, deployability?