尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Random Decision Forests
- at Scale
Todd M. Boetticher| Solution Consultant
2© Cloudera, Inc. All rights reserved.
Overview
• Decision trees
• Introduction to decision tree’s
• Building a decision tree with Spark
• Tuning your model with Hyerparameters
• Random Decision Forests
• Introduction to Random Decision
Forests
• Handicapping individual learners
• Deploying random decision trees in
Spark
3© Cloudera, Inc. All rights reserved.
Decision Tree
The Decision tree is one of the most commonly
used classification techniques. According
to a recent survey, the decision tree is the
most common technique used today.
A decision tree is a collection of outcomes that
eventually lead to a decision. Like the game 21
questions, decision tree’s use features to split
the data into subsets that will give you the best
results.
4© Cloudera, Inc. All rights reserved.
Benefits and disadvantages of Decision Trees
Pros
• Computationally cheap to use
• Easy for humans to understand learned results
• Missing values OK
• Capable of dealing with irrelevant feature
Cons
• Prone to overfitting
• Lacks the performance available through other methods
• Small changes in the Data can have enormous impacts on the data
5© Cloudera, Inc. All rights reserved.
Iris Dataset
6.4, 3.2, 4.5, 1.5, Iris-versacolor
6© Cloudera, Inc. All rights reserved.
Possible Decision Trees
7© Cloudera, Inc. All rights reserved.
Interpreting Models
Simple graphic
explanation of how the
feature space can be
divided into decision
boundaries.
8© Cloudera, Inc. All rights reserved.
Decision Breakdown
def visualize_tree(tree, feature_names):
tree -- scikit-learn DecsisionTree.
feature_names -- list of feature names.
"""
with open("dt.dot", 'w') as f:
export_graphviz(tree, out_file=f,
feature_names=feature_names)
command = ["dot", "-Tpng", "dt.dot", "-o",
"dt.png"]
try:
subprocess.check_call(command)
except:
exit("Could not run dot, ie graphviz, to "
"produce visualization")
9© Cloudera, Inc. All rights reserved.
Building a Decision Tree
10© Cloudera, Inc. All rights reserved.
Name and Address Matching
• Potential uses in every industry
• Marketing
• Defense and law enforcement
• Bank Secrecy Act and Patriot Act Compliance
RN-KOMSOMOLSKY LLC,
352 DeerPath Ave SW,
Leesburg,
Virginia,
United States
KOMSOMOLSKY REFINERY,
1621 Parkcrest Cir,
Reston,
Virginia,
United States
VS
11© Cloudera, Inc. All rights reserved.
0.78, 0.0, 0.0, 0.78, 0.78, True Positive
Name and Address Matching
Name Address City State Country
0.81 0.0 0.0 0.78 1.0
0.6 0.6 0.2 0.0 0.0
0.2 0.0 0.0 0.0 0.0
0.91 0.91 0.91 0.91 1.0
0.78 0.0 0.0 0.78 0.78
0.4 0.4 0.36 0.0 1.0
1.0 0.0 0.0 0.0 0.0
Hits
True-Positive
False-Positive
False-Positive
True-Positive
True-Positive
False-Positive
False-Positive
12© Cloudera, Inc. All rights reserved.
Building our first decision tree in MLlib
13© Cloudera, Inc. All rights reserved.
Evaluating a decision Tree
.82243123866534172
~82.24% accuracy
14© Cloudera, Inc. All rights reserved.
Benchmark vs Random
.2718652198532764
~27.19% accuracy
15© Cloudera, Inc. All rights reserved.
Hyperparameters
Impurity: Gini, Impurity,
Variance
Maximum Depth Maximum Bin
Measures the expected
value of the information.
The calculation of impurity is
generally computed with
Gini impurity or entropy
Measures the expected
value of the information.
The calculation of impurity is
generally computed with
Gini impurity or entropy
Measures the expected
value of the information.
The calculation of impurity is
generally computed with
Gini impurity or entropy
trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy,
impurity, maxDepth, maxBins)
16© Cloudera, Inc. All rights reserved.
Impurity: Gini, Impurity, Variance
1 − 𝑃𝑖2H= - 𝑖=1
𝑛
𝑃(𝑥𝑖)𝑙𝑜𝑔2 𝑃(𝑥𝑖)
Gini Impurity Entropy
Entropy is defined as the expected
value of information. First, we need
to define information. If you’re
classifying something that can take
on multiple values, the information
for symbol 𝑥 𝑖 is defined as
Gini impurity is a measure of how
often a randomly chosen element
from the set would be incorrectly
labeled if it were randomly
labeled according to the
distribution of labels in the subset
1
𝑛 𝑖=1
𝑛
(𝑦𝑖−𝜇)2
Variance
vi is label for an instance, N is the
number of instances and μ is the
mean given by 1N∑Ni=1xi1N∑i=1Nxi.
17© Cloudera, Inc. All rights reserved.
Maximum Depth
• Maximum tree depth is a limit to
stop further splitting of nodes when
the specified tree depth has been
reached during the building of the
initial decision tree.
• The absolute maximum depth
would be N−1, where N is the
number of training samples. You
can derive this by considering that
the least effective split would be
peeling off one training example
per node.
1.
2.
3.
18© Cloudera, Inc. All rights reserved.
Decision Trees to Random
Decision Forests
19© Cloudera, Inc. All rights reserved.
Wisdom of the crowds
• The wisdom of the crowd is the collective opinion of a group of individuals rather
than that of a single expert.
• A large group's aggregated answers to questions involving quantity estimation,
general world knowledge, and spatial reasoning has generally been found to be
as good as, and often better than, the answer given by any of the individuals
within the group. An explanation for this phenomenon is that there is
idiosyncratic noise associated with each individual judgment, and taking the
average over a large number of responses will go some way toward canceling the
effect of this noise.
20© Cloudera, Inc. All rights reserved.
What is Random Forest
Random Forests grows many
classification trees. To classify a new
object from an input vector, put the
input vector down each of the trees in
the forest. Each tree gives a classification,
and we say the tree "votes" for that class.
The forest chooses the classification
having the most votes (over all the trees
in the forest).
21© Cloudera, Inc. All rights reserved.
How to Create a Crowd?
22© Cloudera, Inc. All rights reserved.
Random Decision Forests with Spark MLlib
RandomForest.trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity,
maxDepth, maxBins)
Number of Trees Feature Subset Strategy
Number of trees in the forest. Increasing the number
of trees will decrease the variance in predictions,
improving the model’s test-time accuracy.
Training time increases roughly linearly in the number
of trees.
Number of features to use as candidates for splitting at
each tree node. The number is specified as a fraction or
function of the total number of features. Decreasing this
number will speed up training, but can sometimes
impact performance if too low.
~ 98.67 % Accuracy
23© Cloudera, Inc. All rights reserved.
Trees See Subsets of Examples
24© Cloudera, Inc. All rights reserved.
Or Subsets of Features
25© Cloudera, Inc. All rights reserved.
Thank you
tboetticher@cloudera.com

More Related Content

What's hot

Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural Change
Cloudera, Inc.
 
The Five Markers on Your Big Data Journey
The Five Markers on Your Big Data JourneyThe Five Markers on Your Big Data Journey
The Five Markers on Your Big Data Journey
Cloudera, Inc.
 
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning

Cloudera, Inc.
 
2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union
Cloudera, Inc.
 
The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)
Cloudera, Inc.
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity Renaissance
Cloudera, Inc.
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Cloudera, Inc.
 
IoT-Enabled Predictive Maintenance
IoT-Enabled Predictive MaintenanceIoT-Enabled Predictive Maintenance
IoT-Enabled Predictive Maintenance
Cloudera, Inc.
 
Optimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataOptimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big Data
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
DataStax
 
Demystifying ML & AI
Demystifying ML & AIDemystifying ML & AI
Demystifying ML & AI
Cloudera, Inc.
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
Cloudera, Inc.
 
How Cloudera SDX can aid GDPR compliance 6.21.18
How Cloudera SDX can aid GDPR compliance 6.21.18How Cloudera SDX can aid GDPR compliance 6.21.18
How Cloudera SDX can aid GDPR compliance 6.21.18
Cloudera, Inc.
 
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
DataStax
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber Solution
Cloudera, Inc.
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in Enterprise
Josh Yeh
 

What's hot (20)

Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural Change
 
The Five Markers on Your Big Data Journey
The Five Markers on Your Big Data JourneyThe Five Markers on Your Big Data Journey
The Five Markers on Your Big Data Journey
 
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning

 
2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union
 
The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity Renaissance
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
IoT-Enabled Predictive Maintenance
IoT-Enabled Predictive MaintenanceIoT-Enabled Predictive Maintenance
IoT-Enabled Predictive Maintenance
 
Optimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataOptimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big Data
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Demystifying ML & AI
Demystifying ML & AIDemystifying ML & AI
Demystifying ML & AI
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
How Cloudera SDX can aid GDPR compliance 6.21.18
How Cloudera SDX can aid GDPR compliance 6.21.18How Cloudera SDX can aid GDPR compliance 6.21.18
How Cloudera SDX can aid GDPR compliance 6.21.18
 
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber Solution
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in Enterprise
 

Similar to Random Decision Forests at Scale

Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Edureka!
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012
Salford Systems
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
Datameer
 
Decision Trees
Decision TreesDecision Trees
Random Forests Lightning Talk
Random Forests Lightning TalkRandom Forests Lightning Talk
Random Forests Lightning Talk
Enplus Advisors, Inc.
 
The Data Science Product Management Toolkit
The Data Science Product Management ToolkitThe Data Science Product Management Toolkit
The Data Science Product Management Toolkit
Jack Moore
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
Rupak Roy
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Edureka!
 
unit 5 decision tree2.pptx
unit 5 decision tree2.pptxunit 5 decision tree2.pptx
unit 5 decision tree2.pptx
ssuser5c580e1
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
mark madsen
 
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
Edge AI and Vision Alliance
 
Know How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfKnow How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdf
Data Science Council of America
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
Adam Doyle
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021
Praneeth Vepakomma
 
Churn Modeling For Mobile Telecommunications
Churn Modeling For Mobile TelecommunicationsChurn Modeling For Mobile Telecommunications
Churn Modeling For Mobile Telecommunications
Salford Systems
 
An Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithmsAn Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithms
Shouvic Banik0139
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
Bobby Filar
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
Dinis Cruz
 
Software Faults, Failures and Their Mitigations | Turing100@Persistent
Software Faults, Failures and Their Mitigations | Turing100@PersistentSoftware Faults, Failures and Their Mitigations | Turing100@Persistent
Software Faults, Failures and Their Mitigations | Turing100@Persistent
Persistent Systems Ltd.
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing grid
Thang Nguyen
 

Similar to Random Decision Forests at Scale (20)

Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Random Forests Lightning Talk
Random Forests Lightning TalkRandom Forests Lightning Talk
Random Forests Lightning Talk
 
The Data Science Product Management Toolkit
The Data Science Product Management ToolkitThe Data Science Product Management Toolkit
The Data Science Product Management Toolkit
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
 
unit 5 decision tree2.pptx
unit 5 decision tree2.pptxunit 5 decision tree2.pptx
unit 5 decision tree2.pptx
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
 
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
 
Know How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfKnow How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdf
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021
 
Churn Modeling For Mobile Telecommunications
Churn Modeling For Mobile TelecommunicationsChurn Modeling For Mobile Telecommunications
Churn Modeling For Mobile Telecommunications
 
An Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithmsAn Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithms
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
 
Software Faults, Failures and Their Mitigations | Turing100@Persistent
Software Faults, Failures and Their Mitigations | Turing100@PersistentSoftware Faults, Failures and Their Mitigations | Turing100@Persistent
Software Faults, Failures and Their Mitigations | Turing100@Persistent
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing grid
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
Cloudera, Inc.
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
Cloudera, Inc.
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 

Recently uploaded

Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
meenusingh4354543
 
Solar Panel Service Provider annual maintenance contract.pdf
Solar Panel Service Provider annual maintenance contract.pdfSolar Panel Service Provider annual maintenance contract.pdf
Solar Panel Service Provider annual maintenance contract.pdf
SERVE WELL CRM NASHIK
 
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
Shane Coughlan
 
119321250-History-of-Computer-Programming.ppt
119321250-History-of-Computer-Programming.ppt119321250-History-of-Computer-Programming.ppt
119321250-History-of-Computer-Programming.ppt
lavesingh522
 
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
manji sharman06
 
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdfTheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
Ortus Solutions, Corp
 
Premium Call Girls In Ahmedabad 💯Call Us 🔝 7426014248 🔝Independent Ahmedabad ...
Premium Call Girls In Ahmedabad 💯Call Us 🔝 7426014248 🔝Independent Ahmedabad ...Premium Call Girls In Ahmedabad 💯Call Us 🔝 7426014248 🔝Independent Ahmedabad ...
Premium Call Girls In Ahmedabad 💯Call Us 🔝 7426014248 🔝Independent Ahmedabad ...
Anita pandey
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
confluent
 
What’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 UpdateWhat’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 Update
VictoriaMetrics
 
AI Based Testing - A Comprehensive Guide.pdf
AI Based Testing - A Comprehensive Guide.pdfAI Based Testing - A Comprehensive Guide.pdf
AI Based Testing - A Comprehensive Guide.pdf
kalichargn70th171
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
Zycus
 
Accelerate your Sitecore development with GenAI
Accelerate your Sitecore development with GenAIAccelerate your Sitecore development with GenAI
Accelerate your Sitecore development with GenAI
Ahmed Okour
 
High-Class Call Girls In Chennai 📞7014168258 Available With Direct Cash Payme...
High-Class Call Girls In Chennai 📞7014168258 Available With Direct Cash Payme...High-Class Call Girls In Chennai 📞7014168258 Available With Direct Cash Payme...
High-Class Call Girls In Chennai 📞7014168258 Available With Direct Cash Payme...
shoeb2926
 
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
tinakumariji156
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
servicesNitor
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
OnePlan Solutions
 
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
anshsharma8761
 
Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)
wonyong hwang
 
Call Girls Goa 💯Call Us 🔝 7426014248 🔝 Independent Goa Escorts Service Available
Call Girls Goa 💯Call Us 🔝 7426014248 🔝 Independent Goa Escorts Service AvailableCall Girls Goa 💯Call Us 🔝 7426014248 🔝 Independent Goa Escorts Service Available
Call Girls Goa 💯Call Us 🔝 7426014248 🔝 Independent Goa Escorts Service Available
sapnaanpad7
 
Photo Copier Xerox Machine annual maintenance contract system.pdf
Photo Copier Xerox Machine annual maintenance contract system.pdfPhoto Copier Xerox Machine annual maintenance contract system.pdf
Photo Copier Xerox Machine annual maintenance contract system.pdf
SERVE WELL CRM NASHIK
 

Recently uploaded (20)

Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Bangalore🫱9079923931🫲 High Quality Call Girl Service Right ...
 
Solar Panel Service Provider annual maintenance contract.pdf
Solar Panel Service Provider annual maintenance contract.pdfSolar Panel Service Provider annual maintenance contract.pdf
Solar Panel Service Provider annual maintenance contract.pdf
 
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
OpenChain Webinar - Open Source Due Diligence for M&A - 2024-06-17
 
119321250-History-of-Computer-Programming.ppt
119321250-History-of-Computer-Programming.ppt119321250-History-of-Computer-Programming.ppt
119321250-History-of-Computer-Programming.ppt
 
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
Call Girls Bangalore🔥7023059433🔥Best Profile Escorts in Bangalore Available 24/7
 
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdfTheFutureIsDynamic-BoxLang-CFCamp2024.pdf
TheFutureIsDynamic-BoxLang-CFCamp2024.pdf
 
Premium Call Girls In Ahmedabad 💯Call Us 🔝 7426014248 🔝Independent Ahmedabad ...
Premium Call Girls In Ahmedabad 💯Call Us 🔝 7426014248 🔝Independent Ahmedabad ...Premium Call Girls In Ahmedabad 💯Call Us 🔝 7426014248 🔝Independent Ahmedabad ...
Premium Call Girls In Ahmedabad 💯Call Us 🔝 7426014248 🔝Independent Ahmedabad ...
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
 
What’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 UpdateWhat’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 Update
 
AI Based Testing - A Comprehensive Guide.pdf
AI Based Testing - A Comprehensive Guide.pdfAI Based Testing - A Comprehensive Guide.pdf
AI Based Testing - A Comprehensive Guide.pdf
 
How GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdfHow GenAI Can Improve Supplier Performance Management.pdf
How GenAI Can Improve Supplier Performance Management.pdf
 
Accelerate your Sitecore development with GenAI
Accelerate your Sitecore development with GenAIAccelerate your Sitecore development with GenAI
Accelerate your Sitecore development with GenAI
 
High-Class Call Girls In Chennai 📞7014168258 Available With Direct Cash Payme...
High-Class Call Girls In Chennai 📞7014168258 Available With Direct Cash Payme...High-Class Call Girls In Chennai 📞7014168258 Available With Direct Cash Payme...
High-Class Call Girls In Chennai 📞7014168258 Available With Direct Cash Payme...
 
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
🔥 Kolkata Call Girls  👉 9079923931 👫 High Profile Call Girls Whatsapp Number ...
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
 
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
Call Girls Solapur ☎️ +91-7426014248 😍 Solapur Call Girl Beauty Girls Solapur...
 
Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)
 
Call Girls Goa 💯Call Us 🔝 7426014248 🔝 Independent Goa Escorts Service Available
Call Girls Goa 💯Call Us 🔝 7426014248 🔝 Independent Goa Escorts Service AvailableCall Girls Goa 💯Call Us 🔝 7426014248 🔝 Independent Goa Escorts Service Available
Call Girls Goa 💯Call Us 🔝 7426014248 🔝 Independent Goa Escorts Service Available
 
Photo Copier Xerox Machine annual maintenance contract system.pdf
Photo Copier Xerox Machine annual maintenance contract system.pdfPhoto Copier Xerox Machine annual maintenance contract system.pdf
Photo Copier Xerox Machine annual maintenance contract system.pdf
 

Random Decision Forests at Scale

  • 1. 1© Cloudera, Inc. All rights reserved. Random Decision Forests - at Scale Todd M. Boetticher| Solution Consultant
  • 2. 2© Cloudera, Inc. All rights reserved. Overview • Decision trees • Introduction to decision tree’s • Building a decision tree with Spark • Tuning your model with Hyerparameters • Random Decision Forests • Introduction to Random Decision Forests • Handicapping individual learners • Deploying random decision trees in Spark
  • 3. 3© Cloudera, Inc. All rights reserved. Decision Tree The Decision tree is one of the most commonly used classification techniques. According to a recent survey, the decision tree is the most common technique used today. A decision tree is a collection of outcomes that eventually lead to a decision. Like the game 21 questions, decision tree’s use features to split the data into subsets that will give you the best results.
  • 4. 4© Cloudera, Inc. All rights reserved. Benefits and disadvantages of Decision Trees Pros • Computationally cheap to use • Easy for humans to understand learned results • Missing values OK • Capable of dealing with irrelevant feature Cons • Prone to overfitting • Lacks the performance available through other methods • Small changes in the Data can have enormous impacts on the data
  • 5. 5© Cloudera, Inc. All rights reserved. Iris Dataset 6.4, 3.2, 4.5, 1.5, Iris-versacolor
  • 6. 6© Cloudera, Inc. All rights reserved. Possible Decision Trees
  • 7. 7© Cloudera, Inc. All rights reserved. Interpreting Models Simple graphic explanation of how the feature space can be divided into decision boundaries.
  • 8. 8© Cloudera, Inc. All rights reserved. Decision Breakdown def visualize_tree(tree, feature_names): tree -- scikit-learn DecsisionTree. feature_names -- list of feature names. """ with open("dt.dot", 'w') as f: export_graphviz(tree, out_file=f, feature_names=feature_names) command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"] try: subprocess.check_call(command) except: exit("Could not run dot, ie graphviz, to " "produce visualization")
  • 9. 9© Cloudera, Inc. All rights reserved. Building a Decision Tree
  • 10. 10© Cloudera, Inc. All rights reserved. Name and Address Matching • Potential uses in every industry • Marketing • Defense and law enforcement • Bank Secrecy Act and Patriot Act Compliance RN-KOMSOMOLSKY LLC, 352 DeerPath Ave SW, Leesburg, Virginia, United States KOMSOMOLSKY REFINERY, 1621 Parkcrest Cir, Reston, Virginia, United States VS
  • 11. 11© Cloudera, Inc. All rights reserved. 0.78, 0.0, 0.0, 0.78, 0.78, True Positive Name and Address Matching Name Address City State Country 0.81 0.0 0.0 0.78 1.0 0.6 0.6 0.2 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.91 0.91 0.91 0.91 1.0 0.78 0.0 0.0 0.78 0.78 0.4 0.4 0.36 0.0 1.0 1.0 0.0 0.0 0.0 0.0 Hits True-Positive False-Positive False-Positive True-Positive True-Positive False-Positive False-Positive
  • 12. 12© Cloudera, Inc. All rights reserved. Building our first decision tree in MLlib
  • 13. 13© Cloudera, Inc. All rights reserved. Evaluating a decision Tree .82243123866534172 ~82.24% accuracy
  • 14. 14© Cloudera, Inc. All rights reserved. Benchmark vs Random .2718652198532764 ~27.19% accuracy
  • 15. 15© Cloudera, Inc. All rights reserved. Hyperparameters Impurity: Gini, Impurity, Variance Maximum Depth Maximum Bin Measures the expected value of the information. The calculation of impurity is generally computed with Gini impurity or entropy Measures the expected value of the information. The calculation of impurity is generally computed with Gini impurity or entropy Measures the expected value of the information. The calculation of impurity is generally computed with Gini impurity or entropy trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
  • 16. 16© Cloudera, Inc. All rights reserved. Impurity: Gini, Impurity, Variance 1 − 𝑃𝑖2H= - 𝑖=1 𝑛 𝑃(𝑥𝑖)𝑙𝑜𝑔2 𝑃(𝑥𝑖) Gini Impurity Entropy Entropy is defined as the expected value of information. First, we need to define information. If you’re classifying something that can take on multiple values, the information for symbol 𝑥 𝑖 is defined as Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset 1 𝑛 𝑖=1 𝑛 (𝑦𝑖−𝜇)2 Variance vi is label for an instance, N is the number of instances and μ is the mean given by 1N∑Ni=1xi1N∑i=1Nxi.
  • 17. 17© Cloudera, Inc. All rights reserved. Maximum Depth • Maximum tree depth is a limit to stop further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree. • The absolute maximum depth would be N−1, where N is the number of training samples. You can derive this by considering that the least effective split would be peeling off one training example per node. 1. 2. 3.
  • 18. 18© Cloudera, Inc. All rights reserved. Decision Trees to Random Decision Forests
  • 19. 19© Cloudera, Inc. All rights reserved. Wisdom of the crowds • The wisdom of the crowd is the collective opinion of a group of individuals rather than that of a single expert. • A large group's aggregated answers to questions involving quantity estimation, general world knowledge, and spatial reasoning has generally been found to be as good as, and often better than, the answer given by any of the individuals within the group. An explanation for this phenomenon is that there is idiosyncratic noise associated with each individual judgment, and taking the average over a large number of responses will go some way toward canceling the effect of this noise.
  • 20. 20© Cloudera, Inc. All rights reserved. What is Random Forest Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
  • 21. 21© Cloudera, Inc. All rights reserved. How to Create a Crowd?
  • 22. 22© Cloudera, Inc. All rights reserved. Random Decision Forests with Spark MLlib RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) Number of Trees Feature Subset Strategy Number of trees in the forest. Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy. Training time increases roughly linearly in the number of trees. Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low. ~ 98.67 % Accuracy
  • 23. 23© Cloudera, Inc. All rights reserved. Trees See Subsets of Examples
  • 24. 24© Cloudera, Inc. All rights reserved. Or Subsets of Features
  • 25. 25© Cloudera, Inc. All rights reserved. Thank you tboetticher@cloudera.com

Editor's Notes

  1. Entropy: Information Theory Concept – Claude Shannon Measures mixed-ness, unpredictability of population Lower is better Gini: 1 minus probability that random guess I (probability pi) is correct Lower is better
  2. Table of Contents Introduction 1 Why should I know this? 2 Concept – Example 2 Describe the Iris Data Set 2 Explain what machine learning has to do with the iris data set 2 Physical representation of the decision tree concept 2 Interpreting the Model 2 Building Decision Trees 2 Introduce data set 2 Implement a Decision tree with Data Set 2 Explain how it worked 2 Split the data 2 Calculate Entropy 2 Cache the training data 2 Show current accuracy 2 Benchmark against Random 2 Confusion Matrix 2 Tuning the model (Hyper parameter) 2 Impurity: 2 Maximum Depth: 3 Maximim Bins 3 Try multiple parameters 3 Can the data be improved? 3 Random Decision Forests 3 Wisdom of the crowd 3 Introduce concept of Decision Forests 3 How to create Diversity of opinion 3 Handicap each tree 3 Subset of features 3 Aggregating Results 3 How to implement that in Spark 3                 Introduction -   Good Evening, Today I would like to talk about Random Decision forests and how we can implement them at scale using the Hadoop Ecosystem. I believe that this is important discussion due to the method’s popularity, and its potential impact for most organizations.   Let’s Begin   Concept – Example   Before we can begin talking about Random Decision forests, I believe that it is important that we begin by discussing decision tree’s. If you have ever had the opportunity to play “21 questions” it is a game where an individual picks an object, or place and the competing individual slowly asks binary questions, attempting to shrink the possibilities for the object. This often begins by the person asking if it’s a person or a thing? Generally, these questions are mutually exclusive and collectively exhausted. The inquisitive individual would then ask a follow on question in order to lower the number of possibilities again and again until they know the answer or have run out of available questions. This is a perfect example of a decision tree. The decision tree is one of the most commonly used classification techniques and according to a recent survey, the technique is among most commonly implemented technique today.   The utilization of this technique ranges from agriculture to physics. In physics decision trees have been used for the detection of physical particles, or classification of particle signatures. In other sectors, Decision trees have even been used to build personal learning assistant and classification of sleep patterns.   Due to decisions tree’s ability to be utilized in both classification and regression, the possibilities are almost endless.   Pro’s vs Cons Pro’s - The decision tree is used in so many problems due to its many strengths and short list of disadvantages. One of the key benefits of decision trees is that they implicitly perform variable screening or feature selection naturally. Additionally, the technique is naturally computationally inexpensive and it requires very little effort from the user during data preparation. Lastly, the best feature of using trees for analytics - easy to interpret and explain to executives! This is increasingly important in fields where individuals have to explain why there black box makes specific decisions at certain periods of time. Con’s – Due to the overall flexibility of this technique, models using a decision tree can be prone to overfitting. Overfitting is where the model describes errors in the data and improperly fits a relationship that generally cannot be connected. In addition to their ability to be easily over fitted, the technique also lacks the performance available through other available techniques. Lastly Decision forests can be significantly impacted by small changes in Data over time.   Describe the Iris Data Set We don’t need to understand the relationship in order to model the relationships. We can learn them imperially from the dataset   Explain what machine learning has to do with the iris data set   Attempt to plot the iris Data-set   Graphical example of Knn Iris   Physical representation of the decision tree concept   Interpreting the Model   Introduction to Decision Trees Give Example   Building Decision Trees Introduce data set Implement a Decision tree with Data Set Explain how it worked     Split the data Cache the training data Show current accuracy Benchmark against Random Confusion Matrix   Tuning the model (Hyper parameter) Impurity: Measure how much decision decreases impurity using gini impurity (vs. entropy)   Maximum Depth: Calculates the total depth of the tree and set the number of decisions allowed before terminating a prediction.   Maximim Bins Calculates the total number of rules allowed Fix the training data   Try multiple parameters   Can the data be improved?   Random Decision Forests Wisdom of the crowd Introduce concept of Decision Forests Ensemble Learning   How to create Diversity of opinion Handicap each tree Subset of features Aggregating Results   Hyperparameters in a forest     How to implement that in Spark Determining results from Random Forest      
  翻译: