ngboost.pptx

NGBoost:
Natural
Gradient
Boosting
Mohamed Ali Habib

Outlines
• Introduction.
• What is probabilistic regression?
• Why is it useful?
• How does other methods compare to NGBoost?
• Gradient Boosting Algorithm.
• NGBoost:
• Main components.
• Steps.
• Usage.
• Experiments & Results.
• Computational Complexity.
• Future Work.
• References.

Introduction
What is probabilistic regression?
(Standard Regression)
Note: This use of conditional probability distributions is already the norm in classification

Why is probabilistic regression (prediction) useful?
The measure of uncertainty makes probabilistic prediction crucial in applications like healthcare and
weather forecasting.

Why is probabilistic regression (prediction) useful?
All in all, probabilistic regression (prediction) provides better insight over standard (scalar)
regression.
E[Y|X=x]
X=x P(Y|X=x)

Problems with existing methods
Methods:
• Post-hoc variance.
• Generalized Additive Models for Shape Scale
and Location (GAMLSS)
• Bayesian methods like MMC.
• Bayesian deep learning.
Problems:
• Inflexible.
• Slow.
• Require expert knowledge.
• Make strong assumption about nature of data
(Homoscedasticity*)
Limitations of deep learning methods: difficult to use
out-of-the-box
• Require expert knowledge.
• Usually perform only on par with traditional
methods on limited size or tabular data.
• Require extensive hyperparameter tuning.
* Homoscedasticity: means that all random
variables in a sequence have the same finite
variance.

Gradient Boosting
Machines (GBMs)
• A set of highly modular methods
that:
• work out-of-the-box.
• Perform well on structured
data, even with small datasets.
• Demonstrated empirical success on
Kaggle and other data science
competitions.
Source: what algorithms are most successful on Kaggle?

Problems related to GBMs
• Assume Homoscedasticity: constant variance.
• Predicted distributions should have at least two
degrees of freedom (two parameters) to
effectively convey both the magnitude and the
uncertainty of the predictions.
What is the solution then?
(Spoiler Alert) it is NGBoost 
NGBoost sovles the problem of simultaneous boosting of multiple parameters from
the base learners using:
• A multiparameter boosting approach.
• Use of natural gradients.

Gradient
Boosting
Algorithm
• An ensemble of simple models are involved in making a prediction.
• Results in a prediction model in the form of ensemble weak models.
• Intuition: the best possible next model, when combined with previous models,
minimizes the overall prediction error.
• Components:
• A loss function to be optimized.
• E.g., MSE or Logarithmic Loss.
• A weak learner to make predictions.
• Most common choice is Decision Trees or Regression Trees.
• Common to constrain the learner such as specifying maximum number of
layers, nodes, splits or leaf nodes.
• An additive model to add weak learners to minimize the loss function.
• A gradient descent procedure is used to minimize the loss when adding
trees.

Gradient Boosting
Algorithm
Explanation:
Step 1: Initialize prediction to a constant whose value minimizes
the loss. You could solve using Gradient Descent or manually if
problem is trivial.
Step 2: build the trees (weak learners)
(A) Compute residuals between prediction and observed data.
Use prediction of previous step 𝐹 𝑥 = 𝐹𝑚−1(𝑥), which is
𝐹0(𝑥) for 𝑚 = 1.
(B) Optimize tree on the residuals (make residuals the target
output). 𝑗 here loops over leaf nodes.
(C) Determine output for each leaf in tree. E.g., if leaf has 14.7
and 2.7, then output is the value of 𝛾 that minimizes the
summation. Different than Step 1, here we are taking
previous prediction 𝐹𝑚−1(𝑥𝑖) into account.
(D) Make a new prediction for each sample. The summation
accounts for the case that a single sample ends up in
multiple leaves. So, you take a scaled sum of the outputs 𝛾
for each leaf. Choosing a small learning rate 𝜐 improves
prediction
Step 3: Final prediction is the prediction of the last tree.
To learn more:
• Paper: Greedy Function Approximation: A Gradient Boosting Machine, Jerome H. Friedman.
• Video explanations: Gradient Boost part 1, part 2, part 3, part 4.
• Decision Trees video explanation: Decision Trees.
• AdaBoost video explanation: AdaBoost.

NGBoost: Natural Gradient Boosting
• A method for probabilistic prediction with competitive state-of-the-art performance on a variety
of datasets.
• Combines a multiparameter boosting algorithm with the natural gradient to efficiently how the
parameters of the presumed outcome distribution vary with the observed features.
• In a standard prediction setting:
• the object of interest is the estimate of a scalar function Ε(𝑦|𝑥) where 𝑥 is the vector of covariates
(observed features) and 𝑦 is the prediction target.
• For NGBoost:
• The object of interest is a conditional probability distribution 𝑃𝜃(𝑦|𝑥).
• Assuming 𝑃𝜃 𝑦 𝑥 has a parametric form of 𝑝 parameters where 𝜃 𝜖 ℝ𝑝
(vector of p parameters).

NGBoost: Natural
Gradient Boosting
Components:
• Base learner (e.g. Regression Tree).
• Parametric probability distribution (Normal, Laplace, Poisson, etc.).
• Scoring Rule (MLE, CRPS, etc.).

NGBoost:
Natural
Gradient
Boosting
Steps:
1. Pick a scoring rule to grade our estimate of P(Y|X=x)
2. Assume that P(Y|X=x) has some parametric form
3. Fit the parameters θ(x) as a function of x using
gradient boosting
4. Use the natural gradient to correct the training
dynamics of this approach

Proper Scoring Rule
A proper scoring rule 𝑆(𝑃, 𝑦) must satisfy:
Ε𝑦~𝑄 𝑆(𝑄, 𝑦) ≤ Ε𝑦~𝑄 𝑆 𝑃, 𝑦 ∀ 𝑃, 𝑄
𝑄: 𝑡𝑟𝑢𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑦
𝑃: 𝑎𝑛𝑦 𝑜𝑡ℎ𝑒𝑟 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑒. 𝑔. 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑦
In other words, the scoring rule assigns a score to the forecast such that the true distribution 𝑄
of the outcomes gets the best score in expectation compared to other distributions, like 𝑃.
(Gneiting and Raftery, 2007. Strictly Proper Scoring Rules, Prediction, and Estimation.)

1. Pick a scoring rule to grade our estimate of P(Y|X=x)
Point Prediction Loss Function
Probabilistic
Prediction
Scoring Rule
Example scoring rule: negative log-likelihood
Notes:
• A scoring rule in probabilistic regression is analogous to loss function in standard regression.
• NLL when minimized gives the Maximum Likelihood Estimation (MLE).
• Taking the log simplifies the calculus.
• NLL (MLE) is the most common propre scoring rule.
• CRPS is another good alternative to MLE.

2. Assume P(Y|X=x) has some parametric form
μ = 1
σ = 1
μ = 2
σ = 0.5
μ = 2.5
σ = 0.75
μ = 3.5
σ = 1.5
Note:
here they are
assuming a normal
distribution, but
you can swap out
with any other
distribution
(Poisson, Bernoulli,
etc.) that fits your
application.

3. Fit the parameters θ(x) as a function of x using gradient boosting
μ = 1
σ = 1
μ = 2
σ = 0.5
μ = 2.5
σ = 0.75
μ = 3.5
σ = 1.5

This approach performs poorly in practice.
What we get:
What we want:
The algorithm is failing to adjust the mean which is affecting prediction.
What could be the solution?
Use natural gradients instead of ordinary gradients.

What we typically do: gradient descent in the parameter space
• Pick a small region around your value of 𝜃
• Which direction, to step into in the ball, decreases the score. (aka gradient)

What we want to do: Gradient descent in the space of distributions
Every point in this space represents
some distribution.

Parametrizing the space of distributions
is just a “name” for P
Each distribution has such
a name (i.e. is “identified”
by its parameters)

The problem is:
Gradient descent in the parameter space is not gradient descent in the distribution space because
distances don’t correspond.
That’s because distances are not the same in both spaces.
Spaces have
different
shape and
density

4. Use the natural gradient to correct the training dynamics of this
approach.
this is the natural gradient
Idea: do gradient descent in the distribution by
searching parameters in the transformed region

• is the Riemannian
metric of the space of
distributions
• It depends on the
parametric form chosen
and the score function
• If the score is NLL, this is
the Fisher Information
Here’s the trick:
• Multiplying the ordinary gradient with Riemannian metric which will implicitly transform optimal direction
in parameter space to optimal direction in the distributional space.
• We can conveniently compute the natural gradient by applying a transformation to the gradient

Proper scoring rules
and corresponding
gradients for fitting a
Normal distributions
~𝑁(0,1)

NGBoost
Explanation:
1. Estimate a common 𝜃(0)
such that it minimizes 𝑆.
2. For each iteration 𝑚:
• Compute the natural gradient 𝑔𝑖
(𝑚)
of 𝑆 with
respect to the predicted parameters of that
example up to that stage, 𝜃𝑖
(𝑚−1)
.
• Fit learners, one per parameter, on natural
gradients. E.g. 𝑓(𝑚)
= (𝑓𝜇
𝑚
, 𝑓log 𝜎
𝑚
)
• Compute a scaling factor 𝜌(𝑚)
(scalar) such that
it minimizes true scoring rule along the
projected gradient in the form of line search. In
practice, they found setting 𝜌 = 1 and then
halving successively works well.
• Update predicted parameters.
Notes:
• learning rate 𝜂 is typically 0.1 or 0.01. According to
Friedman assumption.
• Sub-sampling mini-batches can improve computation
performance for large datasets.

Experiments
• UCI ML Repository benchmarks.
• Probabilistic Regression:
• Configuration:
• Data split: 70% training, 20% validation, and 10% testing.
• Repeated 20 times.
• Ablation:
• 2nd-Order boosting: use 2nd order gradients instead of natural gradients.
• Multiparameter boosting: using ordinary gradients instead of natural
gradients.
• Homoscedastic boosting: assuming constant variance to see the benefits of
the allowing parameters other than the conditional mean to vary across 𝑥.
• Why? To demonstrate that multiparameter boosting and the natural
gradient work together to improve performance.
• Point estimation.

Results
The result is equal or better performance than state-of-the art probabilistic prediction methods

Results
NGBoost is competitive for point prediction too

Computational
Complexity
Difference between NGBoost and other boosting algorithms:
• NGBoost is a series of learners that must be fit for each
parameter, whereas standard boosting fits only one series of
learners.
• Natural Gradient 𝑝𝑥𝑝 𝐼𝑠
−1
matrix is computed at each step.
Note that 𝑝 is the number of parameters.
In practice:
• The matrix is small for most used distributions. Only 2x2 if using
Normal distribution.
• If dataset is huge, it may still be expensive to compute large
number of matrices for each iteration.

Future work
• Apply NGBoost to classification (e.g.
survival)
• Joint prediction: 𝑃𝜃(𝑧, 𝑦|𝑥)
• Technical innovations:
• Better tree-based base learners and
regularization are likely to improve
performance especially in terms of large
datasets.

References
• NGBoost: Natural Gradient Boosting for
Probabilistic Prediction
• NGBoost: Stanford ML Group

ngboost.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ngboost.pptx

Similar to ngboost.pptx (20)

Recently uploaded

Recently uploaded (20)

ngboost.pptx