尊敬的 微信汇率:1円 ≈ 0.046239 元 支付宝汇率:1円 ≈ 0.04633元 [退出登录]
SlideShare a Scribd company logo
1
An Idiot’s guide to Support vector
machines (SVMs)
R. Berwick, Village Idiot
SVMs: A New
Generation of Learning Algorithms
• Pre 1980:
– Almost all learning methods learned linear decision surfaces.
– Linear learning methods have nice theoretical properties
• 1980’s
– Decision trees and NNs allowed efficient learning of non-
linear decision surfaces
– Little theoretical basis and all suffer from local minima
• 1990’s
– Efficient learning algorithms for non-linear functions based
on computational learning theory developed
– Nice theoretical properties.
2
Key Ideas
• Two independent developments within last decade
– New efficient separability of non-linear regions that use
“kernel functions” : generalization of ‘similarity’ to
new kinds of similarity measures based on dot products
– Use of quadratic optimization problem to avoid ‘local
minimum’ issues with neural nets
– The resulting learning algorithm is an optimization
algorithm rather than a greedy search
Organization
• Basic idea of support vector machines: just like 1-
layer or multi-layer neural nets
– Optimal hyperplane for linearly separable
patterns
– Extend to patterns that are not linearly
separable by transformations of original data to
map into new space – the Kernel function
• SVM algorithm for pattern recognition
3
Support Vectors
• Support vectors are the data points that lie closest
to the decision surface (or hyperplane)
• They are the data points most difficult to classify
• They have direct bearing on the optimum location
of the decision surface
• We can show that the optimal hyperplane stems
from the function class with the lowest
“capacity”= # of independent features/parameters
we can twiddle [note this is ‘extra’ material not
covered in the lectures… you don’t have to know
this]
Recall from 1-layer nets : Which Separating
Hyperplane?
• In general, lots of possible
solutions for a,b,c (an
infinite number!)
• Support Vector Machine
(SVM) finds an optimal
solution
4
Support Vector Machine (SVM)
Support vectors
Maximize
margin
• SVMs maximize the margin
(Winston terminology: the ‘street’)
around the separating hyperplane.
• The decision function is fully
specified by a (usually very small)
subset of training samples, the
support vectors.
• This becomes a Quadratic
programming problem that is easy
to solve by standard methods
Separation by Hyperplanes
• Assume linear separability for now (we will relax this
later)
• in 2 dimensions, can separate by a line
– in higher dimensions, need hyperplanes
5
General input/output for SVMs just like for
neural nets, but for one important addition…
Input: set of (input, output) training pair samples; call the
input sample features x1, x2…xn, and the output result y.
Typically, there can be lots of input features xi.
Output: set of weights w (or wi), one for each feature,
whose linear combination predicts the value of y. (So far,
just like neural nets…)
Important difference: we use the optimization of maximizing
the margin (‘street width’) to reduce the number of weights
that are nonzero to just a few that correspond to the
important features that ‘matter’ in deciding the separating
line(hyperplane)…these nonzero weights correspond to the
support vectors (because they ‘support’ the separating
hyperplane)
2-D Case
Find a,b,c, such that
ax + by ≥ c for red points
ax + by ≤ (or < ) c for green
points.
6
Which Hyperplane to pick?
• Lots of possible solutions for a,b,c.
• Some methods find a separating
hyperplane, but not the optimal one (e.g.,
neural net)
• But: Which points should influence
optimality?
– All points?
• Linear regression
• Neural nets
– Or only “difficult points” close to
decision boundary
• Support vector machines
Support Vectors again for linearly separable case
• Support vectors are the elements of the training set that
would change the position of the dividing hyperplane if
removed.
• Support vectors are the critical elements of the training set
• The problem of finding the optimal hyper plane is an
optimization problem and can be solved by optimization
techniques (we use Lagrange multipliers to get this
problem into a form that can be solved analytically).
7
X
X
X X
X
X
Support Vectors: Input vectors that just touch the boundary of the
margin (street) – circled below, there are 3 of them (or, rather, the
‘tips’ of the vectors
w0
Tx + b0 = 1 or w0
Tx + b0 = –1
d
X
X
X X
X
X
Here, we have shown the actual support vectors, v1, v2, v3, instead of
just the 3 circled points at the tail ends of the support vectors. d
denotes 1/2 of the street ‘width’
d
v1
v2
v3
8
d+
d-
Definitions
Define the hyperplanes H such that:
w•xi+b ≥ +1 when yi =+1
w•xi+b ≤ -1 when yi = –1
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point
The margin (gutter) of a separating hyperplane is d+ + d–.
H
H1 and H2 are the planes:
H1: w•xi+b = +1
H2: w•xi+b = –1
The points on the planes H1 and
H2 are the tips of the Support
Vectors
The plane H0 is the median in
between, where w•xi+b =0
H1
H2
H0
Moving a support vector
moves the decision
boundary
Moving the other vectors
has no effect
The optimization algorithm to generate the weights proceeds in such a
way that only the support vectors determine the weights and thus the
boundary
9
Maximizing the margin (aka street width)
d+
d-
We want a classifier (linear separator)
with as big a margin as possible.
Recall the distance from a point(x0,y0) to a line:
Ax+By+c = 0 is: |Ax0 +By0 +c|/sqrt(A2+B2), so,
The distance between H0 and H1 is then:
|w•x+b|/||w||=1/||w||, so
The total distance between H1 and H2 is thus: 2/||w||
In order to maximize the margin, we thus need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2:
xi•w+b ≥ +1 when yi =+1
xi•w+b ≤ –1 when yi =–1 Can be combined into: yi(xi•w) ≥ 1
H1
H2
H0
We now must solve a quadratic
programming problem
• Problem is: minimize ||w||, s.t. discrimination boundary is
obeyed, i.e., min f(x) s.t. g(x)=0, which we can rewrite as:
min f: ½ ||w||2 (Note this is a quadratic function)
s.t. g: yi(w•xi)–b = 1 or [yi(w•xi)–b] – 1 =0
This is a constrained optimization problem
It can be solved by the Lagrangian multipler method
Because it is quadratic, the surface is a paraboloid, with just a
single global minimum (thus avoiding a problem we had
with neural nets!)
10
Example: paraboloid 2+x2+2y2 s.t. x+y=1
flatten
Intuition: find intersection of two functions f, g at
a tangent point (intersection = both constraints
satisfied; tangent = derivative is 0); this will be a
min (or max) for f s.t. the contraint g is satisfied
Flattened paraboloid f: 2x2+2y2=0 with superimposed
constraint g: x +y = 1
Minimize when the constraint line g (shown in green)
is tangent to the inner ellipse contour linez of f (shown in red) –
note direction of gradient arrows.
11
flattened paraboloid f: 2+x2+2y2=0 with superimposed constraint
g: x +y = 1; at tangent solution p, gradient vectors of f,g are
parallel (no possible move to increment f that also keeps you in
region g)
Minimize when the constraint line g is tangent to the inner ellipse
contour line of f
Two constraints
1. Parallel normal constraint (= gradient constraint
on f, g s.t. solution is a max, or a min)
2. g(x)=0 (solution is on the constraint line as well)
We now recast these by combining f, g as the new
Lagrangian function by introducing new ‘slack
variables’ denoted a or (more usually, denoted α
in the literature)
12
Redescribing these conditions
• Want to look for solution point p where
• Or, combining these two as the Langrangian L &
requiring derivative of L be zero:
( ) ( )
( ) 0
f p g p
g x
!
" = "
=
L(x,a) = f (x) ! ag(x)
"(x,a) = 0
At a solution p
• The the constraint line g and the contour lines of f must
be tangent
• If they are tangent, their gradient vectors
(perpendiculars) are parallel
• Gradient of g must be 0 – i.e., steepest ascent & so
perpendicular to f
• Gradient of f must also be in the same direction as g
13
How Langrangian solves constrained
optimization
L(x,a) = f (x) ! ag(x) where
"(x,a) = 0
Partial derivatives wrt x recover the parallel normal
constraint
Partial derivatives wrt λ recover the g(x,y)=0
In general,
L(x,a) = f (x) + ai
i
! gi
(x)
In general
L(x,a) = f (x) + ai
i
! gi
(x) a function of n + m variables
n for the x's, m for the a. Differentiating gives n + m equations, each
set to 0. The n eqns differentiated wrt each xi
give the gradient conditions;
the m eqns differentiated wrt each ai
recover the constraints gi
Gradient min of f
constraint condition g
In our case, f(x): ½|| w||2 ; g(x): yi(w•xi +b)–1=0 so Lagrangian is:
min L= ½|| w||2 – Σai[yi(w•xi +b)–1] wrt w, b
We expand the last to get the following L form:
min L= ½|| w||2 – Σaiyi(w•xi +b) +Σai wrt w, b
14
Lagrangian Formulation
• So in the SVM problem the Lagrangian is
• From the property that the derivatives at min = 0
we get:
min LP
= 1
2
w
2
! ai
i=1
l
" yi
xi
# w + b
( )+ ai
i=1
l
"
s.t. $i,ai
% 0 where l is the # of training points
w = ai
i=1
l
! yi
xi
, ai
i=1
l
! yi
= 0
!LP
!w
= w " ai yixi = 0
i=1
l
#
!LP
!b
= ai yi = 0 so
i=1
l
"
What’s with this Lp business?
• This indicates that this is the primal form of the
optimization problem
• We will actually solve the optimization problem
by now solving for the dual of this original
problem
• What is this dual formulation?
15
The Lagrangian Dual Problem: instead of minimizing over w, b,
subject to constraints involving a’s, we can maximize over a (the
dual variable) subject to the relations obtained previously for w
and b
w = ai
i=1
l
! yi
xi
, ai
i=1
l
! yi
= 0
Our solution must satisfy these two relations:
By substituting for w and b back in the original eqn we can get rid of the
dependence on w and b.
Note first that we already now have our answer for what the weights w
must be: they are a linear combination of the training inputs and the
training outputs, xi and yi and the values of a. We will now solve for the
a’s by differentiating the dual problem wrt a, and setting it to zero. Most
of the a’s will turn out to have the value zero. The non-zero a’s will
correspond to the support vectors
Primal problem:
min LP
= 1
2
w
2
! ai
i=1
l
" yi
xi
# w + b
( )+ ai
i=1
l
"
s.t. $i ai
% 0
w = ai
i=1
l
! yi
xi
, ai
i=1
l
! yi
= 0
Dual problem:
max LD
(ai
) = ai
i=1
l
! "
1
2
ai
aj
i=1
l
! yi
yj
xi
#x j
( )
s.t. ai
yi
= 0
i=1
l
! & ai
$ 0
(note that we have removed the dependence on w and b)
16
The Dual problem
• Kuhn-Tucker theorem: the solution we find here will
be the same as the solution to the original problem
• Q: But why are we doing this???? (why not just
solve the original problem????)
• Ans: Because this will let us solve the problem by
computing the just the inner products of xi, xj (which
will be very important later on when we want to
solve non-linearly separable classification problems)
The Dual Problem
Dual problem:
max LD
(ai
) = ai
i=1
l
! "
1
2
ai
aj
i=1
l
! yi
yj
xi
#x j
( )
s.t. ai
yi
= 0
i=1
l
! & ai
$ 0
Notice that all we have are the dot products of xi,xj
If we take the derivative wrt a and set it equal to zero,
we get the following solution, so we can solve for ai:
ai yi
i=1
l
! = 0
0 " ai " C
17
Now knowing the ai we can find the
weights w for the maximal margin
separating hyperplane:
w = ai
i=1
l
! yi
xi
And now, after training and finding the w by this method,
given an unknown point u measured on features xi we
can classify it by looking at the sign of:
f (x) = wiu + b = ( ai yixi iu) + b
i=1
l
!
Remember: most of the weights wi, i.e., the a’s, will be zero
Only the support vectors (on the gutters or margin) will have nonzero
weights or a’s – this reduces the dimensionality of the solution
Why should inner product kernels be involved in pattern
recognition using SVMs, or at all?
– Intuition is that inner products provide some measure of
‘similarity’
– Inner product in 2D between 2 vectors of unit length returns the
cosine of the angle between them = how ‘far apart’ they are
e.g. x = [1, 0]T , y = [0, 1]T
i.e. if they are parallel their inner product is 1 (completely similar)
xT y = x•y = 1
If they are perpendicular (completely unlike) their inner product is
0 (so should not contribute to the correct classifier)
xT• y = x•y = 0
Inner products, similarity, and SVMs
18
Insight into inner products
Consider that we are trying to maximize the form:
LD
(ai
) = ai
i=1
l
! "
1
2
ai
aj
i=1
l
! yi
yj
xi
#x j
( )
s.t. ai
yi
= 0
i=1
l
! & ai
$ 0
The claim is that this function will be maximized if we give nonzero values to a’s that
correspond to the support vectors, ie, those that ‘matter’ in fixing the maximum width
margin (‘street’). Well, consider what this looks like. Note first from the constraint
condition that all the a’s are positive. Now let’s think about a few cases.
Case 1. If two features xi , xj are completely dissimilar, their dot product is 0, and they don’t
contribute to L.
Case 2. If two features xi,xj are completely alike, their dot product is 0. There are 2 subcases.
Subcase 1: both xi,and xj predict the same output value yi (either +1 or –1). Then yi
x yj is always 1, and the value of aiajyiyjxixj will be positive. But this would decrease the
value of L (since it would subtract from the first term sum). So, the algorithm downgrades
similar feature vectors that make the same prediction.
Subcase 2: xi,and xj make opposite predictions about the output value yi (ie, one is
+1, the other –1), but are otherwise very closely similar: then the product aiajyiyjxix is
negative and we are subtracting it, so this adds to the sum, maximizing it. This is precisely
the examples we are looking for: the critical ones that tell the two classses apart.
Insight into inner products, graphically: 2 very
very similar xi, xj vectors that predict difft
classes tend to maximize the margin width
xi
xj
19
2 vectors that are similar but predict the
same class are redundant
xi
xj
2 dissimilar (orthogonal) vectors don’t
count at all
xi
xj
20
But…are we done???
Not Linearly Separable!
Find a line that penalizes
points on “the wrong side”
21
x
x
x
x
x
x x
ϕ (o)
X F
ϕ
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (o)
ϕ (o)
ϕ (o)
ϕ (o)
ϕ (o)
ϕ (o)
o
o
o
o
o
o
Transformation to separate
Non–Linear SVMs
a
b
( )( ) ( )
2
x a x b x a b x ab
! ! = ! + +
{ }
2
,
x x x
!
• The idea is to gain linearly separation by mapping the data to
a higher dimensional space
– The following set can’t be separated by a linear function,
but can be separated by a quadratic one
– So if we map
we gain linear separation
22
Problems with linear SVM
=-1
=+1
What if the decision function is not linear? What transform would separate these?
Ans: polar coordinates!
Non-linear SVM
The Kernel trick
=-1
=+1
Imagine a function φ that maps the data into another space:
φ=Radial→Η
=-1
=+1
Remember the function we want to optimize: Ld = ∑ai – ½∑ai ajyiyj (xi•xj) where (xi•xj) is the
dot product of the two feature vectors. If we now transform to φ, instead of computing this
dot product (xi•xj) we will have to compute (φ (xi)• φ (xj)). But how can we do this? This is
expensive and time consuming (suppose φ is a quartic polynomial… or worse, we don’t know the
function explicitly. Well, here is the neat thing:
If there is a ”kernel function” K such that K(xi,xj) = φ (xi)• φ (xj), then we do not need to know
or compute φ at all!! That is, the kernel function defines inner products in the transformed space.
Or, it defines similarity in the transformed space.
Radial Η
φ
23
Non-linear SVMs
So, the function we end up optimizing is:
Ld = ∑ai – ½∑aiaj yiyjK(xi•xj),
Kernel example: The polynomial kernel
K(xi,xj) = (xi•xj + 1)p, where p is a tunable parameter
Note: Evaluating K only requires one addition and one exponentiation
more than the original dot product
Examples for Non Linear SVMs
( ) ( )
, 1
p
K = ! +
x y x y
( ) { }
2
2
2
, exp
K !
"
= " x y
x y
( ) ( )
, tanh
K ! "
= # $
x y x y
1st is polynomial (includes x•x as special case)
2nd is radial basis function (gaussians)
3rd is sigmoid (neural net activation function)
24
We’ve already seen such nonlinear
transforms…
• What is it???
• tanh(β0xTxi + β1)
• It’s the sigmoid
transform (for neural
nets)
• So, SVMs subsume
neural nets! (but w/o
their problems…)
Inner Product Kernels
Actually works only for
some values of β0 and
β1
tanh(β0xTxi + β1)
Two layer neural net
The width σ2 is
specified a priori
exp(1/(2σ2)||x-xi||2)
Radial-basis function
(RBF)
Power p is specified a
priori by the user
(xTxi + 1)p
Polynomial learning
machine
Usual inner product
Inner Product Kernel
K(x,xi), I = 1, 2, …, N
Type of Support Vector
Machine
25
Kernels generalize the notion of ‘inner
product similarity’
Note that one can define kernels over more than just
vectors: strings, trees, structures, … in fact, just about
anything
A very powerful idea: used in comparing DNA, protein
structure, sentence structures, etc.
Examples for Non Linear SVMs 2 –
Gaussian Kernel
Gaussian
Linear
26
Nonlinear rbf kernel
Admiral’s delight w/ difft kernel
functions
27
Overfitting by SVM
Every point is a support vector… too much freedom to bend to fit the
training data – no generalization.
In fact, SVMs have an ‘automatic’ way to avoid such issues, but we
won’t cover it here… see the book by Vapnik, 1995. (We add a
penalty function for mistakes made after training by over-fitting: recall
that if one over-fits, then one will tend to make errors on new data.
This penalty fn can be put into the quadratic programming problem
directly. You don’t need to know this for this course.)

More Related Content

Similar to Support vector machine in data mining.pdf

machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
AbdusSadik
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
NaglaaAbdelhady
 
SVMs.pptx support vector machines machine learning
SVMs.pptx support vector machines machine learningSVMs.pptx support vector machines machine learning
SVMs.pptx support vector machines machine learning
AmgadAbdallah2
 
bv_cvxslides (1).pdf
bv_cvxslides (1).pdfbv_cvxslides (1).pdf
bv_cvxslides (1).pdf
SantiagoGarridoBulln
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
NBACriteria2SICET
 
support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learning
SamGuy7
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
RanjithaM32
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
Emad Nabil
 
Machine learning interviews day2
Machine learning interviews   day2Machine learning interviews   day2
Machine learning interviews day2
rajmohanc
 
Cg 04-math
Cg 04-mathCg 04-math
Cg 04-math
Hyun Wong Choi
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
Rishabh Gupta
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector
Dr Fereidoun Dejahang
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
Ujjawal
 
Gentle intro to SVM
Gentle intro to SVMGentle intro to SVM
Gentle intro to SVM
Zoya Bylinskii
 
Unit 4
Unit 4Unit 4
Unit 4
Unit 4Unit 4
Unit 4
guna287176
 
Lecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primalLecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primal
Stéphane Canu
 
267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60
Ali Adeel
 
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
The Statistical and Applied Mathematical Sciences Institute
 

Similar to Support vector machine in data mining.pdf (20)

machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
SVMs.pptx support vector machines machine learning
SVMs.pptx support vector machines machine learningSVMs.pptx support vector machines machine learning
SVMs.pptx support vector machines machine learning
 
bv_cvxslides (1).pdf
bv_cvxslides (1).pdfbv_cvxslides (1).pdf
bv_cvxslides (1).pdf
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
 
support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learning
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
Machine learning interviews day2
Machine learning interviews   day2Machine learning interviews   day2
Machine learning interviews day2
 
Cg 04-math
Cg 04-mathCg 04-math
Cg 04-math
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Gentle intro to SVM
Gentle intro to SVMGentle intro to SVM
Gentle intro to SVM
 
Unit 4
Unit 4Unit 4
Unit 4
 
Unit 4
Unit 4Unit 4
Unit 4
 
Lecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primalLecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primal
 
267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60267 handout 2_partial_derivatives_v2.60
267 handout 2_partial_derivatives_v2.60
 
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
 

Recently uploaded

Observational Learning
Observational Learning Observational Learning
Observational Learning
sanamushtaq922
 
8+8+8 Rule Of Time Management For Better Productivity
8+8+8 Rule Of Time Management For Better Productivity8+8+8 Rule Of Time Management For Better Productivity
8+8+8 Rule Of Time Management For Better Productivity
RuchiRathor2
 
family welfare programme-pptx details welfare
family welfare programme-pptx details welfarefamily welfare programme-pptx details welfare
family welfare programme-pptx details welfare
AnushreeBhunia
 
How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17
Celine George
 
Creation or Update of a Mandatory Field is Not Set in Odoo 17
Creation or Update of a Mandatory Field is Not Set in Odoo 17Creation or Update of a Mandatory Field is Not Set in Odoo 17
Creation or Update of a Mandatory Field is Not Set in Odoo 17
Celine George
 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptx
heathfieldcps1
 
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
Nguyen Thanh Tu Collection
 
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
220711130100 udita Chakraborty  Aims and objectives of national policy on inf...220711130100 udita Chakraborty  Aims and objectives of national policy on inf...
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
Kalna College
 
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT KanpurDiversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
Quiz Club IIT Kanpur
 
nutrition in plants chapter 1 class 7...
nutrition in plants chapter 1 class 7...nutrition in plants chapter 1 class 7...
nutrition in plants chapter 1 class 7...
chaudharyreet2244
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
blueshagoo1
 
Project- Comparison among Chhattisgarh and kerala.pptx
Project- Comparison among Chhattisgarh and kerala.pptxProject- Comparison among Chhattisgarh and kerala.pptx
Project- Comparison among Chhattisgarh and kerala.pptx
jeevankraghuraman
 
adjectives.ppt for class 1 to 6, grammar
adjectives.ppt for class 1 to 6, grammaradjectives.ppt for class 1 to 6, grammar
adjectives.ppt for class 1 to 6, grammar
7DFarhanaMohammed
 
Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024
Friends of African Village Libraries
 
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxA Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
OH TEIK BIN
 
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
biruktesfaye27
 
INTRODUCTION TO HOSPITALS & AND ITS ORGANIZATION
INTRODUCTION TO HOSPITALS & AND ITS ORGANIZATION INTRODUCTION TO HOSPITALS & AND ITS ORGANIZATION
INTRODUCTION TO HOSPITALS & AND ITS ORGANIZATION
ShwetaGawande8
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapitolTechU
 
Post init hook in the odoo 17 ERP Module
Post init hook in the  odoo 17 ERP ModulePost init hook in the  odoo 17 ERP Module
Post init hook in the odoo 17 ERP Module
Celine George
 
220711130095 Tanu Pandey message currency, communication speed & control EPC ...
220711130095 Tanu Pandey message currency, communication speed & control EPC ...220711130095 Tanu Pandey message currency, communication speed & control EPC ...
220711130095 Tanu Pandey message currency, communication speed & control EPC ...
Kalna College
 

Recently uploaded (20)

Observational Learning
Observational Learning Observational Learning
Observational Learning
 
8+8+8 Rule Of Time Management For Better Productivity
8+8+8 Rule Of Time Management For Better Productivity8+8+8 Rule Of Time Management For Better Productivity
8+8+8 Rule Of Time Management For Better Productivity
 
family welfare programme-pptx details welfare
family welfare programme-pptx details welfarefamily welfare programme-pptx details welfare
family welfare programme-pptx details welfare
 
How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17
 
Creation or Update of a Mandatory Field is Not Set in Odoo 17
Creation or Update of a Mandatory Field is Not Set in Odoo 17Creation or Update of a Mandatory Field is Not Set in Odoo 17
Creation or Update of a Mandatory Field is Not Set in Odoo 17
 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptx
 
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
 
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
220711130100 udita Chakraborty  Aims and objectives of national policy on inf...220711130100 udita Chakraborty  Aims and objectives of national policy on inf...
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
 
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT KanpurDiversity Quiz Prelims by Quiz Club, IIT Kanpur
Diversity Quiz Prelims by Quiz Club, IIT Kanpur
 
nutrition in plants chapter 1 class 7...
nutrition in plants chapter 1 class 7...nutrition in plants chapter 1 class 7...
nutrition in plants chapter 1 class 7...
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
 
Project- Comparison among Chhattisgarh and kerala.pptx
Project- Comparison among Chhattisgarh and kerala.pptxProject- Comparison among Chhattisgarh and kerala.pptx
Project- Comparison among Chhattisgarh and kerala.pptx
 
adjectives.ppt for class 1 to 6, grammar
adjectives.ppt for class 1 to 6, grammaradjectives.ppt for class 1 to 6, grammar
adjectives.ppt for class 1 to 6, grammar
 
Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024
 
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxA Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
 
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
 
INTRODUCTION TO HOSPITALS & AND ITS ORGANIZATION
INTRODUCTION TO HOSPITALS & AND ITS ORGANIZATION INTRODUCTION TO HOSPITALS & AND ITS ORGANIZATION
INTRODUCTION TO HOSPITALS & AND ITS ORGANIZATION
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
 
Post init hook in the odoo 17 ERP Module
Post init hook in the  odoo 17 ERP ModulePost init hook in the  odoo 17 ERP Module
Post init hook in the odoo 17 ERP Module
 
220711130095 Tanu Pandey message currency, communication speed & control EPC ...
220711130095 Tanu Pandey message currency, communication speed & control EPC ...220711130095 Tanu Pandey message currency, communication speed & control EPC ...
220711130095 Tanu Pandey message currency, communication speed & control EPC ...
 

Support vector machine in data mining.pdf

  • 1. 1 An Idiot’s guide to Support vector machines (SVMs) R. Berwick, Village Idiot SVMs: A New Generation of Learning Algorithms • Pre 1980: – Almost all learning methods learned linear decision surfaces. – Linear learning methods have nice theoretical properties • 1980’s – Decision trees and NNs allowed efficient learning of non- linear decision surfaces – Little theoretical basis and all suffer from local minima • 1990’s – Efficient learning algorithms for non-linear functions based on computational learning theory developed – Nice theoretical properties.
  • 2. 2 Key Ideas • Two independent developments within last decade – New efficient separability of non-linear regions that use “kernel functions” : generalization of ‘similarity’ to new kinds of similarity measures based on dot products – Use of quadratic optimization problem to avoid ‘local minimum’ issues with neural nets – The resulting learning algorithm is an optimization algorithm rather than a greedy search Organization • Basic idea of support vector machines: just like 1- layer or multi-layer neural nets – Optimal hyperplane for linearly separable patterns – Extend to patterns that are not linearly separable by transformations of original data to map into new space – the Kernel function • SVM algorithm for pattern recognition
  • 3. 3 Support Vectors • Support vectors are the data points that lie closest to the decision surface (or hyperplane) • They are the data points most difficult to classify • They have direct bearing on the optimum location of the decision surface • We can show that the optimal hyperplane stems from the function class with the lowest “capacity”= # of independent features/parameters we can twiddle [note this is ‘extra’ material not covered in the lectures… you don’t have to know this] Recall from 1-layer nets : Which Separating Hyperplane? • In general, lots of possible solutions for a,b,c (an infinite number!) • Support Vector Machine (SVM) finds an optimal solution
  • 4. 4 Support Vector Machine (SVM) Support vectors Maximize margin • SVMs maximize the margin (Winston terminology: the ‘street’) around the separating hyperplane. • The decision function is fully specified by a (usually very small) subset of training samples, the support vectors. • This becomes a Quadratic programming problem that is easy to solve by standard methods Separation by Hyperplanes • Assume linear separability for now (we will relax this later) • in 2 dimensions, can separate by a line – in higher dimensions, need hyperplanes
  • 5. 5 General input/output for SVMs just like for neural nets, but for one important addition… Input: set of (input, output) training pair samples; call the input sample features x1, x2…xn, and the output result y. Typically, there can be lots of input features xi. Output: set of weights w (or wi), one for each feature, whose linear combination predicts the value of y. (So far, just like neural nets…) Important difference: we use the optimization of maximizing the margin (‘street width’) to reduce the number of weights that are nonzero to just a few that correspond to the important features that ‘matter’ in deciding the separating line(hyperplane)…these nonzero weights correspond to the support vectors (because they ‘support’ the separating hyperplane) 2-D Case Find a,b,c, such that ax + by ≥ c for red points ax + by ≤ (or < ) c for green points.
  • 6. 6 Which Hyperplane to pick? • Lots of possible solutions for a,b,c. • Some methods find a separating hyperplane, but not the optimal one (e.g., neural net) • But: Which points should influence optimality? – All points? • Linear regression • Neural nets – Or only “difficult points” close to decision boundary • Support vector machines Support Vectors again for linearly separable case • Support vectors are the elements of the training set that would change the position of the dividing hyperplane if removed. • Support vectors are the critical elements of the training set • The problem of finding the optimal hyper plane is an optimization problem and can be solved by optimization techniques (we use Lagrange multipliers to get this problem into a form that can be solved analytically).
  • 7. 7 X X X X X X Support Vectors: Input vectors that just touch the boundary of the margin (street) – circled below, there are 3 of them (or, rather, the ‘tips’ of the vectors w0 Tx + b0 = 1 or w0 Tx + b0 = –1 d X X X X X X Here, we have shown the actual support vectors, v1, v2, v3, instead of just the 3 circled points at the tail ends of the support vectors. d denotes 1/2 of the street ‘width’ d v1 v2 v3
  • 8. 8 d+ d- Definitions Define the hyperplanes H such that: w•xi+b ≥ +1 when yi =+1 w•xi+b ≤ -1 when yi = –1 d+ = the shortest distance to the closest positive point d- = the shortest distance to the closest negative point The margin (gutter) of a separating hyperplane is d+ + d–. H H1 and H2 are the planes: H1: w•xi+b = +1 H2: w•xi+b = –1 The points on the planes H1 and H2 are the tips of the Support Vectors The plane H0 is the median in between, where w•xi+b =0 H1 H2 H0 Moving a support vector moves the decision boundary Moving the other vectors has no effect The optimization algorithm to generate the weights proceeds in such a way that only the support vectors determine the weights and thus the boundary
  • 9. 9 Maximizing the margin (aka street width) d+ d- We want a classifier (linear separator) with as big a margin as possible. Recall the distance from a point(x0,y0) to a line: Ax+By+c = 0 is: |Ax0 +By0 +c|/sqrt(A2+B2), so, The distance between H0 and H1 is then: |w•x+b|/||w||=1/||w||, so The total distance between H1 and H2 is thus: 2/||w|| In order to maximize the margin, we thus need to minimize ||w||. With the condition that there are no datapoints between H1 and H2: xi•w+b ≥ +1 when yi =+1 xi•w+b ≤ –1 when yi =–1 Can be combined into: yi(xi•w) ≥ 1 H1 H2 H0 We now must solve a quadratic programming problem • Problem is: minimize ||w||, s.t. discrimination boundary is obeyed, i.e., min f(x) s.t. g(x)=0, which we can rewrite as: min f: ½ ||w||2 (Note this is a quadratic function) s.t. g: yi(w•xi)–b = 1 or [yi(w•xi)–b] – 1 =0 This is a constrained optimization problem It can be solved by the Lagrangian multipler method Because it is quadratic, the surface is a paraboloid, with just a single global minimum (thus avoiding a problem we had with neural nets!)
  • 10. 10 Example: paraboloid 2+x2+2y2 s.t. x+y=1 flatten Intuition: find intersection of two functions f, g at a tangent point (intersection = both constraints satisfied; tangent = derivative is 0); this will be a min (or max) for f s.t. the contraint g is satisfied Flattened paraboloid f: 2x2+2y2=0 with superimposed constraint g: x +y = 1 Minimize when the constraint line g (shown in green) is tangent to the inner ellipse contour linez of f (shown in red) – note direction of gradient arrows.
  • 11. 11 flattened paraboloid f: 2+x2+2y2=0 with superimposed constraint g: x +y = 1; at tangent solution p, gradient vectors of f,g are parallel (no possible move to increment f that also keeps you in region g) Minimize when the constraint line g is tangent to the inner ellipse contour line of f Two constraints 1. Parallel normal constraint (= gradient constraint on f, g s.t. solution is a max, or a min) 2. g(x)=0 (solution is on the constraint line as well) We now recast these by combining f, g as the new Lagrangian function by introducing new ‘slack variables’ denoted a or (more usually, denoted α in the literature)
  • 12. 12 Redescribing these conditions • Want to look for solution point p where • Or, combining these two as the Langrangian L & requiring derivative of L be zero: ( ) ( ) ( ) 0 f p g p g x ! " = " = L(x,a) = f (x) ! ag(x) "(x,a) = 0 At a solution p • The the constraint line g and the contour lines of f must be tangent • If they are tangent, their gradient vectors (perpendiculars) are parallel • Gradient of g must be 0 – i.e., steepest ascent & so perpendicular to f • Gradient of f must also be in the same direction as g
  • 13. 13 How Langrangian solves constrained optimization L(x,a) = f (x) ! ag(x) where "(x,a) = 0 Partial derivatives wrt x recover the parallel normal constraint Partial derivatives wrt λ recover the g(x,y)=0 In general, L(x,a) = f (x) + ai i ! gi (x) In general L(x,a) = f (x) + ai i ! gi (x) a function of n + m variables n for the x's, m for the a. Differentiating gives n + m equations, each set to 0. The n eqns differentiated wrt each xi give the gradient conditions; the m eqns differentiated wrt each ai recover the constraints gi Gradient min of f constraint condition g In our case, f(x): ½|| w||2 ; g(x): yi(w•xi +b)–1=0 so Lagrangian is: min L= ½|| w||2 – Σai[yi(w•xi +b)–1] wrt w, b We expand the last to get the following L form: min L= ½|| w||2 – Σaiyi(w•xi +b) +Σai wrt w, b
  • 14. 14 Lagrangian Formulation • So in the SVM problem the Lagrangian is • From the property that the derivatives at min = 0 we get: min LP = 1 2 w 2 ! ai i=1 l " yi xi # w + b ( )+ ai i=1 l " s.t. $i,ai % 0 where l is the # of training points w = ai i=1 l ! yi xi , ai i=1 l ! yi = 0 !LP !w = w " ai yixi = 0 i=1 l # !LP !b = ai yi = 0 so i=1 l " What’s with this Lp business? • This indicates that this is the primal form of the optimization problem • We will actually solve the optimization problem by now solving for the dual of this original problem • What is this dual formulation?
  • 15. 15 The Lagrangian Dual Problem: instead of minimizing over w, b, subject to constraints involving a’s, we can maximize over a (the dual variable) subject to the relations obtained previously for w and b w = ai i=1 l ! yi xi , ai i=1 l ! yi = 0 Our solution must satisfy these two relations: By substituting for w and b back in the original eqn we can get rid of the dependence on w and b. Note first that we already now have our answer for what the weights w must be: they are a linear combination of the training inputs and the training outputs, xi and yi and the values of a. We will now solve for the a’s by differentiating the dual problem wrt a, and setting it to zero. Most of the a’s will turn out to have the value zero. The non-zero a’s will correspond to the support vectors Primal problem: min LP = 1 2 w 2 ! ai i=1 l " yi xi # w + b ( )+ ai i=1 l " s.t. $i ai % 0 w = ai i=1 l ! yi xi , ai i=1 l ! yi = 0 Dual problem: max LD (ai ) = ai i=1 l ! " 1 2 ai aj i=1 l ! yi yj xi #x j ( ) s.t. ai yi = 0 i=1 l ! & ai $ 0 (note that we have removed the dependence on w and b)
  • 16. 16 The Dual problem • Kuhn-Tucker theorem: the solution we find here will be the same as the solution to the original problem • Q: But why are we doing this???? (why not just solve the original problem????) • Ans: Because this will let us solve the problem by computing the just the inner products of xi, xj (which will be very important later on when we want to solve non-linearly separable classification problems) The Dual Problem Dual problem: max LD (ai ) = ai i=1 l ! " 1 2 ai aj i=1 l ! yi yj xi #x j ( ) s.t. ai yi = 0 i=1 l ! & ai $ 0 Notice that all we have are the dot products of xi,xj If we take the derivative wrt a and set it equal to zero, we get the following solution, so we can solve for ai: ai yi i=1 l ! = 0 0 " ai " C
  • 17. 17 Now knowing the ai we can find the weights w for the maximal margin separating hyperplane: w = ai i=1 l ! yi xi And now, after training and finding the w by this method, given an unknown point u measured on features xi we can classify it by looking at the sign of: f (x) = wiu + b = ( ai yixi iu) + b i=1 l ! Remember: most of the weights wi, i.e., the a’s, will be zero Only the support vectors (on the gutters or margin) will have nonzero weights or a’s – this reduces the dimensionality of the solution Why should inner product kernels be involved in pattern recognition using SVMs, or at all? – Intuition is that inner products provide some measure of ‘similarity’ – Inner product in 2D between 2 vectors of unit length returns the cosine of the angle between them = how ‘far apart’ they are e.g. x = [1, 0]T , y = [0, 1]T i.e. if they are parallel their inner product is 1 (completely similar) xT y = x•y = 1 If they are perpendicular (completely unlike) their inner product is 0 (so should not contribute to the correct classifier) xT• y = x•y = 0 Inner products, similarity, and SVMs
  • 18. 18 Insight into inner products Consider that we are trying to maximize the form: LD (ai ) = ai i=1 l ! " 1 2 ai aj i=1 l ! yi yj xi #x j ( ) s.t. ai yi = 0 i=1 l ! & ai $ 0 The claim is that this function will be maximized if we give nonzero values to a’s that correspond to the support vectors, ie, those that ‘matter’ in fixing the maximum width margin (‘street’). Well, consider what this looks like. Note first from the constraint condition that all the a’s are positive. Now let’s think about a few cases. Case 1. If two features xi , xj are completely dissimilar, their dot product is 0, and they don’t contribute to L. Case 2. If two features xi,xj are completely alike, their dot product is 0. There are 2 subcases. Subcase 1: both xi,and xj predict the same output value yi (either +1 or –1). Then yi x yj is always 1, and the value of aiajyiyjxixj will be positive. But this would decrease the value of L (since it would subtract from the first term sum). So, the algorithm downgrades similar feature vectors that make the same prediction. Subcase 2: xi,and xj make opposite predictions about the output value yi (ie, one is +1, the other –1), but are otherwise very closely similar: then the product aiajyiyjxix is negative and we are subtracting it, so this adds to the sum, maximizing it. This is precisely the examples we are looking for: the critical ones that tell the two classses apart. Insight into inner products, graphically: 2 very very similar xi, xj vectors that predict difft classes tend to maximize the margin width xi xj
  • 19. 19 2 vectors that are similar but predict the same class are redundant xi xj 2 dissimilar (orthogonal) vectors don’t count at all xi xj
  • 20. 20 But…are we done??? Not Linearly Separable! Find a line that penalizes points on “the wrong side”
  • 21. 21 x x x x x x x ϕ (o) X F ϕ ϕ (x) ϕ (x) ϕ (x) ϕ (x) ϕ (x) ϕ (x) ϕ (x) ϕ (o) ϕ (o) ϕ (o) ϕ (o) ϕ (o) ϕ (o) o o o o o o Transformation to separate Non–Linear SVMs a b ( )( ) ( ) 2 x a x b x a b x ab ! ! = ! + + { } 2 , x x x ! • The idea is to gain linearly separation by mapping the data to a higher dimensional space – The following set can’t be separated by a linear function, but can be separated by a quadratic one – So if we map we gain linear separation
  • 22. 22 Problems with linear SVM =-1 =+1 What if the decision function is not linear? What transform would separate these? Ans: polar coordinates! Non-linear SVM The Kernel trick =-1 =+1 Imagine a function φ that maps the data into another space: φ=Radial→Η =-1 =+1 Remember the function we want to optimize: Ld = ∑ai – ½∑ai ajyiyj (xi•xj) where (xi•xj) is the dot product of the two feature vectors. If we now transform to φ, instead of computing this dot product (xi•xj) we will have to compute (φ (xi)• φ (xj)). But how can we do this? This is expensive and time consuming (suppose φ is a quartic polynomial… or worse, we don’t know the function explicitly. Well, here is the neat thing: If there is a ”kernel function” K such that K(xi,xj) = φ (xi)• φ (xj), then we do not need to know or compute φ at all!! That is, the kernel function defines inner products in the transformed space. Or, it defines similarity in the transformed space. Radial Η φ
  • 23. 23 Non-linear SVMs So, the function we end up optimizing is: Ld = ∑ai – ½∑aiaj yiyjK(xi•xj), Kernel example: The polynomial kernel K(xi,xj) = (xi•xj + 1)p, where p is a tunable parameter Note: Evaluating K only requires one addition and one exponentiation more than the original dot product Examples for Non Linear SVMs ( ) ( ) , 1 p K = ! + x y x y ( ) { } 2 2 2 , exp K ! " = " x y x y ( ) ( ) , tanh K ! " = # $ x y x y 1st is polynomial (includes x•x as special case) 2nd is radial basis function (gaussians) 3rd is sigmoid (neural net activation function)
  • 24. 24 We’ve already seen such nonlinear transforms… • What is it??? • tanh(β0xTxi + β1) • It’s the sigmoid transform (for neural nets) • So, SVMs subsume neural nets! (but w/o their problems…) Inner Product Kernels Actually works only for some values of β0 and β1 tanh(β0xTxi + β1) Two layer neural net The width σ2 is specified a priori exp(1/(2σ2)||x-xi||2) Radial-basis function (RBF) Power p is specified a priori by the user (xTxi + 1)p Polynomial learning machine Usual inner product Inner Product Kernel K(x,xi), I = 1, 2, …, N Type of Support Vector Machine
  • 25. 25 Kernels generalize the notion of ‘inner product similarity’ Note that one can define kernels over more than just vectors: strings, trees, structures, … in fact, just about anything A very powerful idea: used in comparing DNA, protein structure, sentence structures, etc. Examples for Non Linear SVMs 2 – Gaussian Kernel Gaussian Linear
  • 26. 26 Nonlinear rbf kernel Admiral’s delight w/ difft kernel functions
  • 27. 27 Overfitting by SVM Every point is a support vector… too much freedom to bend to fit the training data – no generalization. In fact, SVMs have an ‘automatic’ way to avoid such issues, but we won’t cover it here… see the book by Vapnik, 1995. (We add a penalty function for mistakes made after training by over-fitting: recall that if one over-fits, then one will tend to make errors on new data. This penalty fn can be put into the quadratic programming problem directly. You don’t need to know this for this course.)
  翻译: