尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Reminder: The error surface for a linear neuron
• The error surface lies in a space with a
horizontal axis for each weight and one vertical
axis for the error.
– For a linear neuron with a squared error, it is
a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
• For multi-layer, non-linear nets the error surface
is much more complicated.
– But locally, a piece of a quadratic bowl is
usually a very good approximation.
E
w1
w2
How the learning goes wrong
• If the learning rate is big, the weights slosh to
and fro across the ravine.
– If the learning rate is too big, this
oscillation diverges.
• What we would like to achieve:
– Move quickly in directions with small but
consistent gradients.
– Move slowly in directions with big but
inconsistent gradients.
E
w
Stochastic gradient descent
• If the dataset is highly redundant, the
gradient on the first half is almost
identical to the gradient on the
second half.
– So instead of computing the full
gradient, update the weights using
the gradient on the first half and
then get a gradient for the new
weights on the second half.
– The extreme version of this
approach updates weights after
each case. Its called “online”.
• Mini-batches are usually better
than online.
– Less computation is used
updating the weights.
– Computing the gradient for
many cases simultaneously
uses matrix-matrix
multiplies which are very
efficient, especially on
GPUs
• Mini-batches need to be
balanced for classes
Two types of learning algorithm
If we use the full gradient computed from all
the training cases, there are many clever ways
to speed up learning (e.g. non-linear conjugate
gradient).
– The optimization community has
studied the general problem of
optimizing smooth non-linear
functions for many years.
– Multilayer neural nets are not typical
of the problems they study so their
methods may need a lot of adaptation.
For large neural networks with
very large and highly redundant
training sets, it is nearly always
best to use mini-batch learning.
– The mini-batches may
need to be quite big
when adapting fancy
methods.
– Big mini-batches are
more computationally
efficient.
A basic mini-batch gradient descent algorithm
• Guess an initial learning rate.
– If the error keeps getting worse
or oscillates wildly, reduce the
learning rate.
– If the error is falling fairly
consistently but slowly, increase
the learning rate.
• Write a simple program to automate
this way of adjusting the learning
rate.
• Towards the end of mini-batch
learning it nearly always helps to
turn down the learning rate.
– This removes fluctuations in the
final weights caused by the
variations between mini-
batches.
• Turn down the learning rate when
the error stops decreasing.
– Use the error on a separate
validation set
Be careful about turning down the learning rate
• Turning down the learning
rate reduces the random
fluctuations in the error due
to the different gradients on
different mini-batches.
– So we get a quick win.
– But then we get slower
learning.
• Don’t turn down the
learning rate too soon!
error
epoch
reduce
learning rate
Initializing the weights
• If two hidden units have exactly
the same bias and exactly the
same incoming and outgoing
weights, they will always get
exactly the same gradient.
– So they can never learn to be
different features.
– We break symmetry by
initializing the weights to
have small random values.
• If a hidden unit has a big fan-in,
small changes on many of its
incoming weights can cause the
learning to overshoot.
– We generally want smaller
incoming weights when the
fan-in is big, so initialize the
weights to be proportional to
sqrt(fan-in).
• We can also scale the learning
rate the same way.
Shifting the inputs
• When using steepest descent,
shifting the input values makes a big
difference.
– It usually helps to transform
each component of the input
vector so that it has zero mean
over the whole training set.
• The hypberbolic tangent (which is
2*logistic -1) produces hidden
activations that are roughly zero
mean.
– In this respect its better than the
logistic.
w1 w2
101, 101  2
101, 99  0
gives error
surface
1, 1  2
1, -1  0
gives error
surface
color indicates
training case
Scaling the inputs
• When using steepest descent,
scaling the input values
makes a big difference.
– It usually helps to
transform each
component of the input
vector so that it has unit
variance over the whole
training set.
w1 w2
1, 1  2
1, -1  0
0.1, 10  2
0.1, -10  0
gives error
surface
gives error
surface
color indicates
weight axis
A more thorough method: Decorrelate the input components
• For a linear neuron, we get a big win by decorrelating each component of the
input from the other input components.
• There are several different ways to decorrelate inputs. A reasonable method is
to use Principal Components Analysis.
– Drop the principal components with the smallest eigenvalues.
• This achieves some dimensionality reduction.
– Divide the remaining principal components by the square roots of their
eigenvalues. For a linear neuron, this converts an axis aligned elliptical
error surface into a circular one.
• For a circular error surface, the gradient points straight towards the minimum.
Common problems that occur in multilayer networks
• If we start with a very big learning
rate, the weights of each hidden
unit will all become very big and
positive or very big and negative.
– The error derivatives for the
hidden units will all become
tiny and the error will not
decrease.
– This is usually a plateau, but
people often mistake it for a
local minimum.
• In classification networks that use
a squared error or a cross-entropy
error, the best guessing strategy is
to make each output unit always
produce an output equal to the
proportion of time it should be a
1.
– The network finds this strategy
quickly and may take a long
time to improve on it by
making use of the input.
– This is another plateau that
looks like a local minimum.
Four ways to speed up mini-batch learning
• Use “momentum”
– Instead of using the gradient
to change the position of the
weight “particle”, use it to
change the velocity.
• Use separate adaptive learning
rates for each parameter
– Slowly adjust the rate using
the consistency of the
gradient for that parameter.
• rmsprop: Divide the learning rate for a
weight by a running average of the
magnitudes of recent gradients for that
weight.
– This is the mini-batch version of just
using the sign of the gradient.
• Take a fancy method from the
optimization literature that makes use of
curvature information (not this lecture)
– Adapt it to work for neural nets
– Adapt it to work for mini-batches.
The intuition behind the momentum method
Imagine a ball on the error surface. The
location of the ball in the horizontal
plane represents the weight vector.
– The ball starts off by following the
gradient, but once it has velocity, it
no longer does steepest descent.
– Its momentum makes it keep going
in the previous direction.
• It damps oscillations in directions of
high curvature by combining
gradients with opposite signs.
• It builds up speed in directions with
a gentle but consistent gradient.
The equations of the momentum method
v(t) =a v(t -1)-e
¶E
¶w
(t)
=a v(t -1)-e
¶E
¶w
(t)
=a Dw(t -1)-e
¶E
¶w
(t)
The effect of the gradient is to
increment the previous velocity. The
velocity also decays by a which is
slightly less then 1.
The weight change is equal to the current
velocity.
The weight change can be expressed in
terms of the previous weight change and
the current gradient.
Dw(t) = v(t)
The behavior of the momentum method
• If the error surface is a tilted plane,
the ball reaches a terminal velocity.
– If the momentum is close to 1,
this is much faster than simple
gradient descent.
• At the beginning of learning there may
be very large gradients.
– So it pays to use a small
momentum (e.g. 0.5).
– Once the large gradients have
disappeared and the weights are
stuck in a ravine the momentum
can be smoothly raised to its final
value (e.g. 0.9 or even 0.99)
• This allows us to learn at a rate that
would cause divergent oscillations
without the momentum.
v(¥) =
1
1-a
-e
¶E
¶w
æ
è
ç
ö
ø
÷
A better type of momentum (Nesterov 1983)
• The standard momentum method
first computes the gradient at the
current location and then takes a big
jump in the direction of the updated
accumulated gradient.
• Ilya Sutskever (2012 unpublished)
suggested a new form of momentum
that often works better.
– Inspired by the Nesterov method
for optimizing convex functions.
• First make a big jump in the
direction of the previous
accumulated gradient.
• Then measure the gradient
where you end up and make a
correction.
– Its better to correct a
mistake after you have
made it!
A picture of the Nesterov method
• First make a big jump in the direction of the previous accumulated gradient.
• Then measure the gradient where you end up and make a correction.
brown vector = jump, red vector = correction, green vector = accumulated gradient
blue vectors = standard momentum
The intuition behind separate adaptive learning rates
• In a multilayer net, the appropriate learning rates
can vary widely between weights:
– The magnitudes of the gradients are often very
different for different layers, especially if the initial
weights are small.
– The fan-in of a unit determines the size of the
“overshoot” effects caused by simultaneously
changing many of the incoming weights of a unit to
correct the same error.
• So use a global learning rate (set by hand)
multiplied by an appropriate local gain that is
determined empirically for each weight.
Gradients can get very
small in the early layers of
very deep nets.
The fan-in often varies
widely between layers.
One way to determine the individual learning rates
• Start with a local gain of 1 for every weight.
• Increase the local gain if the gradient for
that weight does not change sign.
• Use small additive increases and
multiplicative decreases (for mini-batch)
– This ensures that big gains decay rapidly
when oscillations start.
– If the gradient is totally random the gain
will hover around 1 when we increase
by plus half the time and decrease
by times half the time.
if
¶E
¶wij
(t)
¶E
¶wij
(t -1)
æ
è
ç
ç
ö
ø
÷
÷ > 0
then gij (t) = gij (t -1)+.05
else gij (t) = gij (t -1)*.95
d
1-d
Dwij = -e gij
¶E
¶wij
Tricks for making adaptive learning rates work better
• Limit the gains to lie in some
reasonable range
– e.g. [0.1, 10] or [.01, 100]
• Use full batch learning or big mini-
batches
– This ensures that changes in
the sign of the gradient are
not mainly due to the
sampling error of a mini-
batch.
• Adaptive learning rates can be
combined with momentum.
– Use the agreement in sign
between the current gradient for a
weight and the velocity for that
weight (Jacobs, 1989).
• Adaptive learning rates only deal with
axis-aligned effects.
– Momentum does not care about
the alignment of the axes.
Neural Networks for Machine Learning
Lecture 6e
rmsprop: Divide the gradient by a running average
of its recent magnitude
Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
rprop: Using only the sign of the gradient
• The magnitude of the gradient can be
very different for different weights
and can change during learning.
– This makes it hard to choose a
single global learning rate.
• For full batch learning, we can deal
with this variation by only using the
sign of the gradient.
– The weight updates are all of the
same magnitude.
– This escapes from plateaus with
tiny gradients quickly.
• rprop: This combines the idea of only
using the sign of the gradient with the
idea of adapting the step size separately
for each weight.
– Increase the step size for a weight
multiplicatively (e.g. times 1.2) if the
signs of its last two gradients agree.
– Otherwise decrease the step size
multiplicatively (e.g. times 0.5).
– Limit the step sizes to be less than 50
and more than a millionth (Mike
Shuster’s advice).
Why rprop does not work with mini-batches
• The idea behind stochastic gradient
descent is that when the learning
rate is small, it averages the
gradients over successive mini-
batches.
– Consider a weight that gets a
gradient of +0.1 on nine mini-
batches and a gradient of -0.9
on the tenth mini-batch.
– We want this weight to stay
roughly where it is.
• rprop would increment the weight
nine times and decrement it once by
about the same amount (assuming
any adaptation of the step sizes is
small on this time-scale).
– So the weight would grow a lot.
• Is there a way to combine:
– The robustness of rprop.
– The efficiency of mini-batches.
– The effective averaging of
gradients over mini-batches.
rmsprop: A mini-batch version of rprop
• rprop is equivalent to using the gradient but also dividing by the size of the
gradient.
– The problem with mini-batch rprop is that we divide by a different number
for each mini-batch. So why not force the number we divide by to be very
similar for adjacent mini-batches?
• rmsprop: Keep a moving average of the squared gradient for each weight
• Dividing the gradient by makes the learning work much
better (Tijmen Tieleman, unpublished).
MeanSquare(w, t) = 0.9 MeanSquare(w, t-1) + 0.1 ¶E
¶w
(t)
( )
2
MeanSquare(w, t)
Further developments of rmsprop
• Combining rmsprop with standard momentum
– Momentum does not help as much as it normally does. Needs more
investigation.
• Combining rmsprop with Nesterov momentum (Sutskever 2012)
– It works best if the RMS of the recent gradients is used to divide the
correction rather than the jump in the direction of accumulated corrections.
• Combining rmsprop with adaptive learning rates for each connection
– Needs more investigation.
• Other methods related to rmsprop
– Yann LeCun’s group has a fancy version in “No more pesky learning rates”
Summary of learning methods for neural networks
• For small datasets (e.g. 10,000 cases)
or bigger datasets without much
redundancy, use a full-batch method.
– Conjugate gradient, LBFGS ...
– adaptive learning rates, rprop ...
• For big, redundant datasets use mini-
batches.
– Try gradient descent with
momentum.
– Try rmsprop (with momentum ?)
– Try LeCun’s latest recipe.
• Why there is no simple recipe:
Neural nets differ a lot:
– Very deep nets (especially ones
with narrow bottlenecks).
– Recurrent nets.
– Wide shallow nets.
Tasks differ a lot:
– Some require very accurate
weights, some don’t.
– Some have many very rare
cases (e.g. words).

More Related Content

Similar to Learning algorithm including gradient descent.pptx

ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
DebabrataPain1
 
Advanced Machine Learning
Advanced Machine LearningAdvanced Machine Learning
Advanced Machine Learning
ANANDBABUGOPATHOTI1
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
Priyanka Reddy
 
lec10new.ppt
lec10new.pptlec10new.ppt
lec10new.ppt
SumantKuch
 
Lec10new
Lec10newLec10new
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
Universitat Politècnica de Catalunya
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
lecture-05.pptx
lecture-05.pptxlecture-05.pptx
lecture-05.pptx
SSSSSSSSSSSS5
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 
Lever lab
Lever labLever lab
Lever lab
Jenny Dixon
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimization
Hanya Mohammed
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
mohanapriyastp
 
Linear regression
Linear regressionLinear regression
Linear regression
MartinHogg9
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
Joe li
 
Chap 4 local_search
Chap 4 local_search Chap 4 local_search
Chap 4 local_search
Rakhi Gupta
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 
Quasi newton artificial neural network training algorithms
Quasi newton artificial neural network training algorithmsQuasi newton artificial neural network training algorithms
Quasi newton artificial neural network training algorithms
Mrinmoy Majumder
 
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering CollegeArtificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Dhivyaa C.R
 
UNIT III (8).pptx
UNIT III (8).pptxUNIT III (8).pptx
UNIT III (8).pptx
DrDhivyaaCRAssistant
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
Nimrita Koul
 

Similar to Learning algorithm including gradient descent.pptx (20)

ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Advanced Machine Learning
Advanced Machine LearningAdvanced Machine Learning
Advanced Machine Learning
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
lec10new.ppt
lec10new.pptlec10new.ppt
lec10new.ppt
 
Lec10new
Lec10newLec10new
Lec10new
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
 
lecture-05.pptx
lecture-05.pptxlecture-05.pptx
lecture-05.pptx
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Lever lab
Lever labLever lab
Lever lab
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimization
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Chap 4 local_search
Chap 4 local_search Chap 4 local_search
Chap 4 local_search
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Quasi newton artificial neural network training algorithms
Quasi newton artificial neural network training algorithmsQuasi newton artificial neural network training algorithms
Quasi newton artificial neural network training algorithms
 
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering CollegeArtificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
Artificial Neural Network by Dr.C.R.Dhivyaa Kongu Engineering College
 
UNIT III (8).pptx
UNIT III (8).pptxUNIT III (8).pptx
UNIT III (8).pptx
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 

Recently uploaded

PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
incitbe
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Gabi Münster
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
hanshkumar9870
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
Vijayabaskar Uthirapathy
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
newdirectionconsulta
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
meenusingh4354543
 
Health care analysis using sentimental analysis
Health care analysis using sentimental analysisHealth care analysis using sentimental analysis
Health care analysis using sentimental analysis
krishnasrigannavarap
 
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In BangaloreBangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
yashusingh54876
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
mona lisa $A12
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
yuvishachadda
 
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book NowMumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
radhika ansal $A12
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
zoykygu
 
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 

Recently uploaded (20)

PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
 
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering RoadshowDirect Lake Deep Dive slides from Fabric Engineering Roadshow
Direct Lake Deep Dive slides from Fabric Engineering Roadshow
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
Mumbai Call Girls service 9920874524 Call Girl service in Mumbai Mumbai Call ...
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
 
Health care analysis using sentimental analysis
Health care analysis using sentimental analysisHealth care analysis using sentimental analysis
Health care analysis using sentimental analysis
 
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In BangaloreBangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
Delhi Call Girls Karol Bagh 👉 9711199012 👈 unlimited short high profile full ...
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
 
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book NowMumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
Mumbai Central Call Girls ☑ +91-9833325238 ☑ Available Hot Girls Aunty Book Now
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
 
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
 

Learning algorithm including gradient descent.pptx

  • 1. Reminder: The error surface for a linear neuron • The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error. – For a linear neuron with a squared error, it is a quadratic bowl. – Vertical cross-sections are parabolas. – Horizontal cross-sections are ellipses. • For multi-layer, non-linear nets the error surface is much more complicated. – But locally, a piece of a quadratic bowl is usually a very good approximation. E w1 w2
  • 2. How the learning goes wrong • If the learning rate is big, the weights slosh to and fro across the ravine. – If the learning rate is too big, this oscillation diverges. • What we would like to achieve: – Move quickly in directions with small but consistent gradients. – Move slowly in directions with big but inconsistent gradients. E w
  • 3. Stochastic gradient descent • If the dataset is highly redundant, the gradient on the first half is almost identical to the gradient on the second half. – So instead of computing the full gradient, update the weights using the gradient on the first half and then get a gradient for the new weights on the second half. – The extreme version of this approach updates weights after each case. Its called “online”. • Mini-batches are usually better than online. – Less computation is used updating the weights. – Computing the gradient for many cases simultaneously uses matrix-matrix multiplies which are very efficient, especially on GPUs • Mini-batches need to be balanced for classes
  • 4. Two types of learning algorithm If we use the full gradient computed from all the training cases, there are many clever ways to speed up learning (e.g. non-linear conjugate gradient). – The optimization community has studied the general problem of optimizing smooth non-linear functions for many years. – Multilayer neural nets are not typical of the problems they study so their methods may need a lot of adaptation. For large neural networks with very large and highly redundant training sets, it is nearly always best to use mini-batch learning. – The mini-batches may need to be quite big when adapting fancy methods. – Big mini-batches are more computationally efficient.
  • 5. A basic mini-batch gradient descent algorithm • Guess an initial learning rate. – If the error keeps getting worse or oscillates wildly, reduce the learning rate. – If the error is falling fairly consistently but slowly, increase the learning rate. • Write a simple program to automate this way of adjusting the learning rate. • Towards the end of mini-batch learning it nearly always helps to turn down the learning rate. – This removes fluctuations in the final weights caused by the variations between mini- batches. • Turn down the learning rate when the error stops decreasing. – Use the error on a separate validation set
  • 6. Be careful about turning down the learning rate • Turning down the learning rate reduces the random fluctuations in the error due to the different gradients on different mini-batches. – So we get a quick win. – But then we get slower learning. • Don’t turn down the learning rate too soon! error epoch reduce learning rate
  • 7. Initializing the weights • If two hidden units have exactly the same bias and exactly the same incoming and outgoing weights, they will always get exactly the same gradient. – So they can never learn to be different features. – We break symmetry by initializing the weights to have small random values. • If a hidden unit has a big fan-in, small changes on many of its incoming weights can cause the learning to overshoot. – We generally want smaller incoming weights when the fan-in is big, so initialize the weights to be proportional to sqrt(fan-in). • We can also scale the learning rate the same way.
  • 8. Shifting the inputs • When using steepest descent, shifting the input values makes a big difference. – It usually helps to transform each component of the input vector so that it has zero mean over the whole training set. • The hypberbolic tangent (which is 2*logistic -1) produces hidden activations that are roughly zero mean. – In this respect its better than the logistic. w1 w2 101, 101  2 101, 99  0 gives error surface 1, 1  2 1, -1  0 gives error surface color indicates training case
  • 9. Scaling the inputs • When using steepest descent, scaling the input values makes a big difference. – It usually helps to transform each component of the input vector so that it has unit variance over the whole training set. w1 w2 1, 1  2 1, -1  0 0.1, 10  2 0.1, -10  0 gives error surface gives error surface color indicates weight axis
  • 10. A more thorough method: Decorrelate the input components • For a linear neuron, we get a big win by decorrelating each component of the input from the other input components. • There are several different ways to decorrelate inputs. A reasonable method is to use Principal Components Analysis. – Drop the principal components with the smallest eigenvalues. • This achieves some dimensionality reduction. – Divide the remaining principal components by the square roots of their eigenvalues. For a linear neuron, this converts an axis aligned elliptical error surface into a circular one. • For a circular error surface, the gradient points straight towards the minimum.
  • 11. Common problems that occur in multilayer networks • If we start with a very big learning rate, the weights of each hidden unit will all become very big and positive or very big and negative. – The error derivatives for the hidden units will all become tiny and the error will not decrease. – This is usually a plateau, but people often mistake it for a local minimum. • In classification networks that use a squared error or a cross-entropy error, the best guessing strategy is to make each output unit always produce an output equal to the proportion of time it should be a 1. – The network finds this strategy quickly and may take a long time to improve on it by making use of the input. – This is another plateau that looks like a local minimum.
  • 12. Four ways to speed up mini-batch learning • Use “momentum” – Instead of using the gradient to change the position of the weight “particle”, use it to change the velocity. • Use separate adaptive learning rates for each parameter – Slowly adjust the rate using the consistency of the gradient for that parameter. • rmsprop: Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight. – This is the mini-batch version of just using the sign of the gradient. • Take a fancy method from the optimization literature that makes use of curvature information (not this lecture) – Adapt it to work for neural nets – Adapt it to work for mini-batches.
  • 13. The intuition behind the momentum method Imagine a ball on the error surface. The location of the ball in the horizontal plane represents the weight vector. – The ball starts off by following the gradient, but once it has velocity, it no longer does steepest descent. – Its momentum makes it keep going in the previous direction. • It damps oscillations in directions of high curvature by combining gradients with opposite signs. • It builds up speed in directions with a gentle but consistent gradient.
  • 14. The equations of the momentum method v(t) =a v(t -1)-e ¶E ¶w (t) =a v(t -1)-e ¶E ¶w (t) =a Dw(t -1)-e ¶E ¶w (t) The effect of the gradient is to increment the previous velocity. The velocity also decays by a which is slightly less then 1. The weight change is equal to the current velocity. The weight change can be expressed in terms of the previous weight change and the current gradient. Dw(t) = v(t)
  • 15. The behavior of the momentum method • If the error surface is a tilted plane, the ball reaches a terminal velocity. – If the momentum is close to 1, this is much faster than simple gradient descent. • At the beginning of learning there may be very large gradients. – So it pays to use a small momentum (e.g. 0.5). – Once the large gradients have disappeared and the weights are stuck in a ravine the momentum can be smoothly raised to its final value (e.g. 0.9 or even 0.99) • This allows us to learn at a rate that would cause divergent oscillations without the momentum. v(¥) = 1 1-a -e ¶E ¶w æ è ç ö ø ÷
  • 16. A better type of momentum (Nesterov 1983) • The standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. • Ilya Sutskever (2012 unpublished) suggested a new form of momentum that often works better. – Inspired by the Nesterov method for optimizing convex functions. • First make a big jump in the direction of the previous accumulated gradient. • Then measure the gradient where you end up and make a correction. – Its better to correct a mistake after you have made it!
  • 17. A picture of the Nesterov method • First make a big jump in the direction of the previous accumulated gradient. • Then measure the gradient where you end up and make a correction. brown vector = jump, red vector = correction, green vector = accumulated gradient blue vectors = standard momentum
  • 18. The intuition behind separate adaptive learning rates • In a multilayer net, the appropriate learning rates can vary widely between weights: – The magnitudes of the gradients are often very different for different layers, especially if the initial weights are small. – The fan-in of a unit determines the size of the “overshoot” effects caused by simultaneously changing many of the incoming weights of a unit to correct the same error. • So use a global learning rate (set by hand) multiplied by an appropriate local gain that is determined empirically for each weight. Gradients can get very small in the early layers of very deep nets. The fan-in often varies widely between layers.
  • 19. One way to determine the individual learning rates • Start with a local gain of 1 for every weight. • Increase the local gain if the gradient for that weight does not change sign. • Use small additive increases and multiplicative decreases (for mini-batch) – This ensures that big gains decay rapidly when oscillations start. – If the gradient is totally random the gain will hover around 1 when we increase by plus half the time and decrease by times half the time. if ¶E ¶wij (t) ¶E ¶wij (t -1) æ è ç ç ö ø ÷ ÷ > 0 then gij (t) = gij (t -1)+.05 else gij (t) = gij (t -1)*.95 d 1-d Dwij = -e gij ¶E ¶wij
  • 20. Tricks for making adaptive learning rates work better • Limit the gains to lie in some reasonable range – e.g. [0.1, 10] or [.01, 100] • Use full batch learning or big mini- batches – This ensures that changes in the sign of the gradient are not mainly due to the sampling error of a mini- batch. • Adaptive learning rates can be combined with momentum. – Use the agreement in sign between the current gradient for a weight and the velocity for that weight (Jacobs, 1989). • Adaptive learning rates only deal with axis-aligned effects. – Momentum does not care about the alignment of the axes.
  • 21. Neural Networks for Machine Learning Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude Geoffrey Hinton with Nitish Srivastava Kevin Swersky
  • 22. rprop: Using only the sign of the gradient • The magnitude of the gradient can be very different for different weights and can change during learning. – This makes it hard to choose a single global learning rate. • For full batch learning, we can deal with this variation by only using the sign of the gradient. – The weight updates are all of the same magnitude. – This escapes from plateaus with tiny gradients quickly. • rprop: This combines the idea of only using the sign of the gradient with the idea of adapting the step size separately for each weight. – Increase the step size for a weight multiplicatively (e.g. times 1.2) if the signs of its last two gradients agree. – Otherwise decrease the step size multiplicatively (e.g. times 0.5). – Limit the step sizes to be less than 50 and more than a millionth (Mike Shuster’s advice).
  • 23. Why rprop does not work with mini-batches • The idea behind stochastic gradient descent is that when the learning rate is small, it averages the gradients over successive mini- batches. – Consider a weight that gets a gradient of +0.1 on nine mini- batches and a gradient of -0.9 on the tenth mini-batch. – We want this weight to stay roughly where it is. • rprop would increment the weight nine times and decrement it once by about the same amount (assuming any adaptation of the step sizes is small on this time-scale). – So the weight would grow a lot. • Is there a way to combine: – The robustness of rprop. – The efficiency of mini-batches. – The effective averaging of gradients over mini-batches.
  • 24. rmsprop: A mini-batch version of rprop • rprop is equivalent to using the gradient but also dividing by the size of the gradient. – The problem with mini-batch rprop is that we divide by a different number for each mini-batch. So why not force the number we divide by to be very similar for adjacent mini-batches? • rmsprop: Keep a moving average of the squared gradient for each weight • Dividing the gradient by makes the learning work much better (Tijmen Tieleman, unpublished). MeanSquare(w, t) = 0.9 MeanSquare(w, t-1) + 0.1 ¶E ¶w (t) ( ) 2 MeanSquare(w, t)
  • 25. Further developments of rmsprop • Combining rmsprop with standard momentum – Momentum does not help as much as it normally does. Needs more investigation. • Combining rmsprop with Nesterov momentum (Sutskever 2012) – It works best if the RMS of the recent gradients is used to divide the correction rather than the jump in the direction of accumulated corrections. • Combining rmsprop with adaptive learning rates for each connection – Needs more investigation. • Other methods related to rmsprop – Yann LeCun’s group has a fancy version in “No more pesky learning rates”
  • 26. Summary of learning methods for neural networks • For small datasets (e.g. 10,000 cases) or bigger datasets without much redundancy, use a full-batch method. – Conjugate gradient, LBFGS ... – adaptive learning rates, rprop ... • For big, redundant datasets use mini- batches. – Try gradient descent with momentum. – Try rmsprop (with momentum ?) – Try LeCun’s latest recipe. • Why there is no simple recipe: Neural nets differ a lot: – Very deep nets (especially ones with narrow bottlenecks). – Recurrent nets. – Wide shallow nets. Tasks differ a lot: – Some require very accurate weights, some don’t. – Some have many very rare cases (e.g. words).
  翻译: