1. Support vector machines (SVMs) are a type of machine learning algorithm that learn nonlinear decision boundaries using kernel functions to transform data into higher dimensions.
2. SVMs find the optimal separating hyperplane that maximizes the margin between positive and negative examples. This hyperplane is determined by the support vectors, which are the data points closest to the decision boundary.
3. The SVM optimization problem involves minimizing a loss function subject to constraints. This can be solved using Lagrangian duality, which transforms the problem into an equivalent maximization problem over dual variables instead of the original weights and biases.
Support Vector Machines aim to find an optimal decision boundary that maximizes the margin between different classes of data points. This is achieved by formulating the problem as a constrained optimization problem that seeks to minimize training error while maximizing the margin. The dual formulation results in a quadratic programming problem that can be solved using algorithms like sequential minimal optimization. Kernels allow the data to be implicitly mapped to a higher dimensional feature space, enabling non-linear decision boundaries to be learned. This "kernel trick" avoids explicitly computing coordinates in the higher dimensional space.
Support vector machines (SVMs) find the optimal separating hyperplane between two classes of data points that maximizes the margin between the classes. SVMs address nonlinear classification problems by using kernel functions to implicitly map inputs into high-dimensional feature spaces. The three key ideas of SVMs are: 1) Allowing for misclassified points using slack variables. 2) Seeking a large margin hyperplane for better generalization. 3) Using the "kernel trick" to efficiently perform computations in high-dimensional feature spaces without explicitly computing the mappings.
Support Vector Machines is the the the the the the the the thesanjaibalajeessn
This document provides an overview of support vector machines (SVMs) and how they can be used for both linear and non-linear classification problems. It explains that SVMs find the optimal separating hyperplane that maximizes the margin between classes. For non-linearly separable data, the document introduces kernel functions, which map the data into a higher-dimensional feature space to allow for nonlinear decision boundaries through the "kernel trick" of computing inner products without explicitly performing the mapping.
The document discusses support vector machines (SVMs). SVMs find the optimal separating hyperplane between classes that maximizes the margin between them. They can handle nonlinear data using kernels to map the data into higher dimensions where a linear separator may exist. Key aspects include defining the maximum margin hyperplane, using regularization and slack variables to deal with misclassified examples, and kernels which implicitly map data into other feature spaces without explicitly computing the transformations. The regularization and gamma parameters affect model complexity, with regularization controlling overfitting and gamma influencing the similarity between points.
Support Vector Machines (SVMs) were proposed in 1963 and took shape in the late 1970s as part of statistical learning theory. They became popular in the last decade for classification, regression, and optimization tasks. SVMs find the optimal separating hyperplane that maximizes the margin between positive and negative examples by solving a quadratic programming optimization with linear constraints. This is done efficiently using kernel methods that implicitly map inputs to high-dimensional feature spaces without explicitly computing the mappings. Empirically, SVMs have been shown to generalize well and are among the best performing classifiers.
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Maninda Edirisooriya
Support Vector Machines are one of the main tool in classical Machine Learning toolbox. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Support Vector Machines aim to find an optimal decision boundary that maximizes the margin between different classes of data points. This is achieved by formulating the problem as a constrained optimization problem that seeks to minimize training error while maximizing the margin. The dual formulation results in a quadratic programming problem that can be solved using algorithms like sequential minimal optimization. Kernels allow the data to be implicitly mapped to a higher dimensional feature space, enabling non-linear decision boundaries to be learned. This "kernel trick" avoids explicitly computing coordinates in the higher dimensional space.
Support vector machines (SVMs) find the optimal separating hyperplane between two classes of data points that maximizes the margin between the classes. SVMs address nonlinear classification problems by using kernel functions to implicitly map inputs into high-dimensional feature spaces. The three key ideas of SVMs are: 1) Allowing for misclassified points using slack variables. 2) Seeking a large margin hyperplane for better generalization. 3) Using the "kernel trick" to efficiently perform computations in high-dimensional feature spaces without explicitly computing the mappings.
Support Vector Machines is the the the the the the the the thesanjaibalajeessn
This document provides an overview of support vector machines (SVMs) and how they can be used for both linear and non-linear classification problems. It explains that SVMs find the optimal separating hyperplane that maximizes the margin between classes. For non-linearly separable data, the document introduces kernel functions, which map the data into a higher-dimensional feature space to allow for nonlinear decision boundaries through the "kernel trick" of computing inner products without explicitly performing the mapping.
The document discusses support vector machines (SVMs). SVMs find the optimal separating hyperplane between classes that maximizes the margin between them. They can handle nonlinear data using kernels to map the data into higher dimensions where a linear separator may exist. Key aspects include defining the maximum margin hyperplane, using regularization and slack variables to deal with misclassified examples, and kernels which implicitly map data into other feature spaces without explicitly computing the transformations. The regularization and gamma parameters affect model complexity, with regularization controlling overfitting and gamma influencing the similarity between points.
Support Vector Machines (SVMs) were proposed in 1963 and took shape in the late 1970s as part of statistical learning theory. They became popular in the last decade for classification, regression, and optimization tasks. SVMs find the optimal separating hyperplane that maximizes the margin between positive and negative examples by solving a quadratic programming optimization with linear constraints. This is done efficiently using kernel methods that implicitly map inputs to high-dimensional feature spaces without explicitly computing the mappings. Empirically, SVMs have been shown to generalize well and are among the best performing classifiers.
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Maninda Edirisooriya
Support Vector Machines are one of the main tool in classical Machine Learning toolbox. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
- Dimensionality reduction techniques assign instances to vectors in a lower-dimensional space while approximately preserving similarity relationships. Principal component analysis (PCA) is a common linear dimensionality reduction technique.
- Kernel PCA performs PCA in a higher-dimensional feature space implicitly defined by a kernel function. This allows PCA to find nonlinear structure in data. Kernel PCA computes the principal components by finding the eigenvectors of the normalized kernel matrix.
- For a new data point, its representation in the lower-dimensional space is given by projecting it onto the principal components in feature space using the kernel trick, without explicitly computing features.
This document provides an introduction to support vector machines (SVMs) for text classification. It discusses how SVMs find an optimal separating hyperplane that maximizes the margin between classes. SVMs can handle non-linear classification through the use of kernels, which map data into a higher-dimensional feature space. The document outlines the mathematical formulations of linear and soft-margin SVMs, explains how the kernel trick allows evaluating inner products implicitly in that feature space, and summarizes how SVMs are used for classification tasks.
This document introduces the topic of convex optimization. It discusses mathematical optimization problems and notes that certain classes of problems, such as least-squares problems, linear programs, and convex optimization problems, can be solved efficiently and reliably. The document then provides examples of convex optimization problems and discusses how they can be solved. It outlines the goals and topics that will be covered in the course, including convex sets and functions, algorithms, and applications.
This document discusses support vector machines (SVMs) for classification tasks. It describes how SVMs find the optimal separating hyperplane with the maximum margin between classes in the training data. This is formulated as a quadratic optimization problem that can be solved using algorithms that construct a dual problem. Non-linear SVMs are also discussed, using the "kernel trick" to implicitly map data into higher-dimensional feature spaces. Common kernel functions and the theoretical justification for maximum margin classifiers are provided.
support vector machine algorithm in machine learningSamGuy7
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the
SVMs are known for their effectiveness in high-dimensional spaces and their ability to handle complex data patterns. data points
This document discusses support vector machines (SVMs) for classification tasks. It describes how SVMs find the optimal separating hyperplane with the maximum margin between classes in the training data. This is formulated as a quadratic optimization problem that can be solved using algorithms that construct a dual problem. Non-linear SVMs are also discussed, using the "kernel trick" to implicitly map data to higher dimensions where a linear separator can be found.
This document provides an overview of support vector machines (SVMs) for machine learning. It explains that SVMs find the optimal separating hyperplane that maximizes the margin between examples of separate classes. This is achieved by formulating SVM training as a convex optimization problem that can be solved efficiently. The document discusses how SVMs can handle non-linear decision boundaries using the "kernel trick" to implicitly map examples to higher-dimensional feature spaces without explicitly performing the mapping.
1) Decision trees are models that partition the feature space into rectangular regions and make predictions based on the region a sample falls into. They can be used for both classification and regression problems.
2) Support vector machines (SVMs) look for the optimal separating hyperplane that maximizes the margin between the classes. The hard margin SVM requires all samples to be classified correctly while the soft margin SVM allows for some misclassification using slack variables.
3) Kernel SVMs map the input data into a higher dimensional feature space to allow for nonlinear decision boundaries using kernel functions such as the radial basis function kernel. This helps address the limitations of linear SVMs.
The document provides an overview of basic math concepts for computer graphics, including:
- Sets, mappings, and Cartesian coordinates are introduced to represent vectors and points in 2D and 3D space.
- Linear interpolation is described as a fundamental operation in graphics used to connect data points.
- Parametric and implicit equations are discussed for representing common 2D curves and lines.
- Concepts like the dot product, cross product, and gradient are covered, which are important for calculations involving vectors.
This document provides an overview of support vector machines (SVM). It explains that SVM is a supervised machine learning algorithm used for classification and regression. It works by finding the optimal separating hyperplane that maximizes the margin between different classes of data points. The document discusses key SVM concepts like slack variables, kernels, hyperparameters like C and gamma, and how the kernel trick allows SVMs to fit non-linear decision boundaries.
1629 stochastic subgradient approach for solving linear support vectorDr Fereidoun Dejahang
This document describes a stochastic subgradient descent approach called Pegasos for efficiently solving linear support vector machines (SVMs) on large datasets. Pegasos improves upon traditional gradient descent methods by using a more aggressive learning rate that allows for faster convergence to suboptimal solutions, which often generalize well to new examples. The key aspects of Pegasos are that it uses mini-batches of training examples to estimate subgradients, projects parameter updates into a bounded space, and converges to solutions much more quickly than traditional SVM solvers while achieving comparable test error rates. Experiments on a large text dataset demonstrate Pegasos' ability to reach accurate solutions orders of magnitude faster than conventional solvers like SVM Light.
This document summarizes support vector machines (SVMs). It explains that SVMs find the optimal separating hyperplane that maximizes the margin between two classes of data points. The hyperplane is determined by support vectors, which are the data points closest to the hyperplane. SVMs can be solved as a quadratic programming problem. The document also discusses how kernels can map data into higher dimensional spaces to make non-separable problems separable by SVMs.
These notes are a basic introduction to SVM, assuming almost no prior exposure. They contain some derivations, details, and explanations that not many SVM tutorials usually delve into. Thus, they're meant to augment primary course material (textbook or lecture notes) on SVMs and to help digest the course material.
Iterative improvement is an algorithm design technique for solving optimization problems. It starts with a feasible solution and repeatedly makes small changes to the current solution to find a solution with a better objective function value until no further improvements can be found. The simplex method, Ford-Fulkerson algorithm, and local search heuristics are examples that use this technique. The maximum flow problem can be solved using the iterative Ford-Fulkerson algorithm, which finds augmenting paths in a flow network to incrementally increase the flow from the source to the sink until no more augmenting paths exist.
Iterative improvement is an algorithm design technique for solving optimization problems. It starts with a feasible solution and repeatedly makes small changes to the current solution to find a solution with a better objective function value until no further improvements can be found. The simplex method, Ford-Fulkerson algorithm, and local search heuristics are examples that use this technique. The maximum flow problem can be solved using the iterative Ford-Fulkerson algorithm, which finds augmenting paths in a flow network to incrementally increase the flow from the source to the sink until no more augmenting paths exist.
This document provides an overview of linear support vector machines (SVMs) for classification. It discusses how SVMs find the optimal separating hyperplane between two classes by maximizing the margin, or distance between the closest data points of each class.
The problem of finding this optimal hyperplane is formulated as a quadratic programming problem that minimizes the norm of the weight vector subject to constraints requiring data points to lie on the correct side of the margin. Alternately, the problem can be formulated as a linear program that minimizes the L1 norm of the weight vector.
Finally, the document outlines the key steps in SVM classification and references further resources on the topic.
This document provides an overview of partial derivatives, which are used to analyze functions with multiple variables. Key topics covered include:
- Definitions of limits, continuity, and partial derivatives for multivariable functions.
- Directional derivatives and the gradient, which describe the rate of change in a specified direction.
- The chain rule for partial derivatives, and implicit differentiation.
- Linearization and Taylor series approximations for multivariable functions.
- Finding local extrema and optimizing functions, using techniques like classifying critical points.
This document presents a splitting method for optimizing nonsmooth nonconvex problems of the form h(Ax) + g(x), where h is nonsmooth and nonconvex, A is a linear map, and g(x) is a convex regularizer. The method relaxes the problem by introducing an auxiliary variable w and minimizing a partially minimized objective with respect to x and w alternately using proximal gradient descent. Applications to problems in phase retrieval, semi-supervised learning, and stochastic shortest path are discussed. Convergence results and empirical performance on these applications are presented.
8+8+8 Rule Of Time Management For Better ProductivityRuchiRathor2
This is a great way to be more productive but a few things to
Keep in mind:
- The 8+8+8 rule offers a general guideline. You may need to adjust the schedule depending on your individual needs and commitments.
- Some days may require more work or less sleep, demanding flexibility in your approach.
- The key is to be mindful of your time allocation and strive for a healthy balance across the three categories.
More Related Content
Similar to Support vector machine in data mining.pdf
- Dimensionality reduction techniques assign instances to vectors in a lower-dimensional space while approximately preserving similarity relationships. Principal component analysis (PCA) is a common linear dimensionality reduction technique.
- Kernel PCA performs PCA in a higher-dimensional feature space implicitly defined by a kernel function. This allows PCA to find nonlinear structure in data. Kernel PCA computes the principal components by finding the eigenvectors of the normalized kernel matrix.
- For a new data point, its representation in the lower-dimensional space is given by projecting it onto the principal components in feature space using the kernel trick, without explicitly computing features.
This document provides an introduction to support vector machines (SVMs) for text classification. It discusses how SVMs find an optimal separating hyperplane that maximizes the margin between classes. SVMs can handle non-linear classification through the use of kernels, which map data into a higher-dimensional feature space. The document outlines the mathematical formulations of linear and soft-margin SVMs, explains how the kernel trick allows evaluating inner products implicitly in that feature space, and summarizes how SVMs are used for classification tasks.
This document introduces the topic of convex optimization. It discusses mathematical optimization problems and notes that certain classes of problems, such as least-squares problems, linear programs, and convex optimization problems, can be solved efficiently and reliably. The document then provides examples of convex optimization problems and discusses how they can be solved. It outlines the goals and topics that will be covered in the course, including convex sets and functions, algorithms, and applications.
This document discusses support vector machines (SVMs) for classification tasks. It describes how SVMs find the optimal separating hyperplane with the maximum margin between classes in the training data. This is formulated as a quadratic optimization problem that can be solved using algorithms that construct a dual problem. Non-linear SVMs are also discussed, using the "kernel trick" to implicitly map data into higher-dimensional feature spaces. Common kernel functions and the theoretical justification for maximum margin classifiers are provided.
support vector machine algorithm in machine learningSamGuy7
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the
SVMs are known for their effectiveness in high-dimensional spaces and their ability to handle complex data patterns. data points
This document discusses support vector machines (SVMs) for classification tasks. It describes how SVMs find the optimal separating hyperplane with the maximum margin between classes in the training data. This is formulated as a quadratic optimization problem that can be solved using algorithms that construct a dual problem. Non-linear SVMs are also discussed, using the "kernel trick" to implicitly map data to higher dimensions where a linear separator can be found.
This document provides an overview of support vector machines (SVMs) for machine learning. It explains that SVMs find the optimal separating hyperplane that maximizes the margin between examples of separate classes. This is achieved by formulating SVM training as a convex optimization problem that can be solved efficiently. The document discusses how SVMs can handle non-linear decision boundaries using the "kernel trick" to implicitly map examples to higher-dimensional feature spaces without explicitly performing the mapping.
1) Decision trees are models that partition the feature space into rectangular regions and make predictions based on the region a sample falls into. They can be used for both classification and regression problems.
2) Support vector machines (SVMs) look for the optimal separating hyperplane that maximizes the margin between the classes. The hard margin SVM requires all samples to be classified correctly while the soft margin SVM allows for some misclassification using slack variables.
3) Kernel SVMs map the input data into a higher dimensional feature space to allow for nonlinear decision boundaries using kernel functions such as the radial basis function kernel. This helps address the limitations of linear SVMs.
The document provides an overview of basic math concepts for computer graphics, including:
- Sets, mappings, and Cartesian coordinates are introduced to represent vectors and points in 2D and 3D space.
- Linear interpolation is described as a fundamental operation in graphics used to connect data points.
- Parametric and implicit equations are discussed for representing common 2D curves and lines.
- Concepts like the dot product, cross product, and gradient are covered, which are important for calculations involving vectors.
This document provides an overview of support vector machines (SVM). It explains that SVM is a supervised machine learning algorithm used for classification and regression. It works by finding the optimal separating hyperplane that maximizes the margin between different classes of data points. The document discusses key SVM concepts like slack variables, kernels, hyperparameters like C and gamma, and how the kernel trick allows SVMs to fit non-linear decision boundaries.
1629 stochastic subgradient approach for solving linear support vectorDr Fereidoun Dejahang
This document describes a stochastic subgradient descent approach called Pegasos for efficiently solving linear support vector machines (SVMs) on large datasets. Pegasos improves upon traditional gradient descent methods by using a more aggressive learning rate that allows for faster convergence to suboptimal solutions, which often generalize well to new examples. The key aspects of Pegasos are that it uses mini-batches of training examples to estimate subgradients, projects parameter updates into a bounded space, and converges to solutions much more quickly than traditional SVM solvers while achieving comparable test error rates. Experiments on a large text dataset demonstrate Pegasos' ability to reach accurate solutions orders of magnitude faster than conventional solvers like SVM Light.
This document summarizes support vector machines (SVMs). It explains that SVMs find the optimal separating hyperplane that maximizes the margin between two classes of data points. The hyperplane is determined by support vectors, which are the data points closest to the hyperplane. SVMs can be solved as a quadratic programming problem. The document also discusses how kernels can map data into higher dimensional spaces to make non-separable problems separable by SVMs.
These notes are a basic introduction to SVM, assuming almost no prior exposure. They contain some derivations, details, and explanations that not many SVM tutorials usually delve into. Thus, they're meant to augment primary course material (textbook or lecture notes) on SVMs and to help digest the course material.
Iterative improvement is an algorithm design technique for solving optimization problems. It starts with a feasible solution and repeatedly makes small changes to the current solution to find a solution with a better objective function value until no further improvements can be found. The simplex method, Ford-Fulkerson algorithm, and local search heuristics are examples that use this technique. The maximum flow problem can be solved using the iterative Ford-Fulkerson algorithm, which finds augmenting paths in a flow network to incrementally increase the flow from the source to the sink until no more augmenting paths exist.
Iterative improvement is an algorithm design technique for solving optimization problems. It starts with a feasible solution and repeatedly makes small changes to the current solution to find a solution with a better objective function value until no further improvements can be found. The simplex method, Ford-Fulkerson algorithm, and local search heuristics are examples that use this technique. The maximum flow problem can be solved using the iterative Ford-Fulkerson algorithm, which finds augmenting paths in a flow network to incrementally increase the flow from the source to the sink until no more augmenting paths exist.
This document provides an overview of linear support vector machines (SVMs) for classification. It discusses how SVMs find the optimal separating hyperplane between two classes by maximizing the margin, or distance between the closest data points of each class.
The problem of finding this optimal hyperplane is formulated as a quadratic programming problem that minimizes the norm of the weight vector subject to constraints requiring data points to lie on the correct side of the margin. Alternately, the problem can be formulated as a linear program that minimizes the L1 norm of the weight vector.
Finally, the document outlines the key steps in SVM classification and references further resources on the topic.
This document provides an overview of partial derivatives, which are used to analyze functions with multiple variables. Key topics covered include:
- Definitions of limits, continuity, and partial derivatives for multivariable functions.
- Directional derivatives and the gradient, which describe the rate of change in a specified direction.
- The chain rule for partial derivatives, and implicit differentiation.
- Linearization and Taylor series approximations for multivariable functions.
- Finding local extrema and optimizing functions, using techniques like classifying critical points.
This document presents a splitting method for optimizing nonsmooth nonconvex problems of the form h(Ax) + g(x), where h is nonsmooth and nonconvex, A is a linear map, and g(x) is a convex regularizer. The method relaxes the problem by introducing an auxiliary variable w and minimizing a partially minimized objective with respect to x and w alternately using proximal gradient descent. Applications to problems in phase retrieval, semi-supervised learning, and stochastic shortest path are discussed. Convergence results and empirical performance on these applications are presented.
Similar to Support vector machine in data mining.pdf (20)
8+8+8 Rule Of Time Management For Better ProductivityRuchiRathor2
This is a great way to be more productive but a few things to
Keep in mind:
- The 8+8+8 rule offers a general guideline. You may need to adjust the schedule depending on your individual needs and commitments.
- Some days may require more work or less sleep, demanding flexibility in your approach.
- The key is to be mindful of your time allocation and strive for a healthy balance across the three categories.
How to Download & Install Module From the Odoo App Store in Odoo 17Celine George
Custom modules offer the flexibility to extend Odoo's capabilities, address unique requirements, and optimize workflows to align seamlessly with your organization's processes. By leveraging custom modules, businesses can unlock greater efficiency, productivity, and innovation, empowering them to stay competitive in today's dynamic market landscape. In this tutorial, we'll guide you step by step on how to easily download and install modules from the Odoo App Store.
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxOH TEIK BIN
(A Free eBook comprising 3 Sets of Presentation of a selection of Puzzles, Brain Teasers and Thinking Problems to exercise both the mind and the Right and Left Brain. To help keep the mind and brain fit and healthy. Good for both the young and old alike.
Answers are given for all the puzzles and problems.)
With Metta,
Bro. Oh Teik Bin 🙏🤓🤔🥰
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapitolTechU
Slides from a Capitol Technology University webinar held June 20, 2024. The webinar featured Dr. Donovan Wright, presenting on the Department of Defense Digital Transformation.
Post init hook in the odoo 17 ERP ModuleCeline George
In Odoo, hooks are functions that are presented as a string in the __init__ file of a module. They are the functions that can execute before and after the existing code.
220711130095 Tanu Pandey message currency, communication speed & control EPC ...
Support vector machine in data mining.pdf
1. 1
An Idiot’s guide to Support vector
machines (SVMs)
R. Berwick, Village Idiot
SVMs: A New
Generation of Learning Algorithms
• Pre 1980:
– Almost all learning methods learned linear decision surfaces.
– Linear learning methods have nice theoretical properties
• 1980’s
– Decision trees and NNs allowed efficient learning of non-
linear decision surfaces
– Little theoretical basis and all suffer from local minima
• 1990’s
– Efficient learning algorithms for non-linear functions based
on computational learning theory developed
– Nice theoretical properties.
2. 2
Key Ideas
• Two independent developments within last decade
– New efficient separability of non-linear regions that use
“kernel functions” : generalization of ‘similarity’ to
new kinds of similarity measures based on dot products
– Use of quadratic optimization problem to avoid ‘local
minimum’ issues with neural nets
– The resulting learning algorithm is an optimization
algorithm rather than a greedy search
Organization
• Basic idea of support vector machines: just like 1-
layer or multi-layer neural nets
– Optimal hyperplane for linearly separable
patterns
– Extend to patterns that are not linearly
separable by transformations of original data to
map into new space – the Kernel function
• SVM algorithm for pattern recognition
3. 3
Support Vectors
• Support vectors are the data points that lie closest
to the decision surface (or hyperplane)
• They are the data points most difficult to classify
• They have direct bearing on the optimum location
of the decision surface
• We can show that the optimal hyperplane stems
from the function class with the lowest
“capacity”= # of independent features/parameters
we can twiddle [note this is ‘extra’ material not
covered in the lectures… you don’t have to know
this]
Recall from 1-layer nets : Which Separating
Hyperplane?
• In general, lots of possible
solutions for a,b,c (an
infinite number!)
• Support Vector Machine
(SVM) finds an optimal
solution
4. 4
Support Vector Machine (SVM)
Support vectors
Maximize
margin
• SVMs maximize the margin
(Winston terminology: the ‘street’)
around the separating hyperplane.
• The decision function is fully
specified by a (usually very small)
subset of training samples, the
support vectors.
• This becomes a Quadratic
programming problem that is easy
to solve by standard methods
Separation by Hyperplanes
• Assume linear separability for now (we will relax this
later)
• in 2 dimensions, can separate by a line
– in higher dimensions, need hyperplanes
5. 5
General input/output for SVMs just like for
neural nets, but for one important addition…
Input: set of (input, output) training pair samples; call the
input sample features x1, x2…xn, and the output result y.
Typically, there can be lots of input features xi.
Output: set of weights w (or wi), one for each feature,
whose linear combination predicts the value of y. (So far,
just like neural nets…)
Important difference: we use the optimization of maximizing
the margin (‘street width’) to reduce the number of weights
that are nonzero to just a few that correspond to the
important features that ‘matter’ in deciding the separating
line(hyperplane)…these nonzero weights correspond to the
support vectors (because they ‘support’ the separating
hyperplane)
2-D Case
Find a,b,c, such that
ax + by ≥ c for red points
ax + by ≤ (or < ) c for green
points.
6. 6
Which Hyperplane to pick?
• Lots of possible solutions for a,b,c.
• Some methods find a separating
hyperplane, but not the optimal one (e.g.,
neural net)
• But: Which points should influence
optimality?
– All points?
• Linear regression
• Neural nets
– Or only “difficult points” close to
decision boundary
• Support vector machines
Support Vectors again for linearly separable case
• Support vectors are the elements of the training set that
would change the position of the dividing hyperplane if
removed.
• Support vectors are the critical elements of the training set
• The problem of finding the optimal hyper plane is an
optimization problem and can be solved by optimization
techniques (we use Lagrange multipliers to get this
problem into a form that can be solved analytically).
7. 7
X
X
X X
X
X
Support Vectors: Input vectors that just touch the boundary of the
margin (street) – circled below, there are 3 of them (or, rather, the
‘tips’ of the vectors
w0
Tx + b0 = 1 or w0
Tx + b0 = –1
d
X
X
X X
X
X
Here, we have shown the actual support vectors, v1, v2, v3, instead of
just the 3 circled points at the tail ends of the support vectors. d
denotes 1/2 of the street ‘width’
d
v1
v2
v3
8. 8
d+
d-
Definitions
Define the hyperplanes H such that:
w•xi+b ≥ +1 when yi =+1
w•xi+b ≤ -1 when yi = –1
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point
The margin (gutter) of a separating hyperplane is d+ + d–.
H
H1 and H2 are the planes:
H1: w•xi+b = +1
H2: w•xi+b = –1
The points on the planes H1 and
H2 are the tips of the Support
Vectors
The plane H0 is the median in
between, where w•xi+b =0
H1
H2
H0
Moving a support vector
moves the decision
boundary
Moving the other vectors
has no effect
The optimization algorithm to generate the weights proceeds in such a
way that only the support vectors determine the weights and thus the
boundary
9. 9
Maximizing the margin (aka street width)
d+
d-
We want a classifier (linear separator)
with as big a margin as possible.
Recall the distance from a point(x0,y0) to a line:
Ax+By+c = 0 is: |Ax0 +By0 +c|/sqrt(A2+B2), so,
The distance between H0 and H1 is then:
|w•x+b|/||w||=1/||w||, so
The total distance between H1 and H2 is thus: 2/||w||
In order to maximize the margin, we thus need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2:
xi•w+b ≥ +1 when yi =+1
xi•w+b ≤ –1 when yi =–1 Can be combined into: yi(xi•w) ≥ 1
H1
H2
H0
We now must solve a quadratic
programming problem
• Problem is: minimize ||w||, s.t. discrimination boundary is
obeyed, i.e., min f(x) s.t. g(x)=0, which we can rewrite as:
min f: ½ ||w||2 (Note this is a quadratic function)
s.t. g: yi(w•xi)–b = 1 or [yi(w•xi)–b] – 1 =0
This is a constrained optimization problem
It can be solved by the Lagrangian multipler method
Because it is quadratic, the surface is a paraboloid, with just a
single global minimum (thus avoiding a problem we had
with neural nets!)
10. 10
Example: paraboloid 2+x2+2y2 s.t. x+y=1
flatten
Intuition: find intersection of two functions f, g at
a tangent point (intersection = both constraints
satisfied; tangent = derivative is 0); this will be a
min (or max) for f s.t. the contraint g is satisfied
Flattened paraboloid f: 2x2+2y2=0 with superimposed
constraint g: x +y = 1
Minimize when the constraint line g (shown in green)
is tangent to the inner ellipse contour linez of f (shown in red) –
note direction of gradient arrows.
11. 11
flattened paraboloid f: 2+x2+2y2=0 with superimposed constraint
g: x +y = 1; at tangent solution p, gradient vectors of f,g are
parallel (no possible move to increment f that also keeps you in
region g)
Minimize when the constraint line g is tangent to the inner ellipse
contour line of f
Two constraints
1. Parallel normal constraint (= gradient constraint
on f, g s.t. solution is a max, or a min)
2. g(x)=0 (solution is on the constraint line as well)
We now recast these by combining f, g as the new
Lagrangian function by introducing new ‘slack
variables’ denoted a or (more usually, denoted α
in the literature)
12. 12
Redescribing these conditions
• Want to look for solution point p where
• Or, combining these two as the Langrangian L &
requiring derivative of L be zero:
( ) ( )
( ) 0
f p g p
g x
!
" = "
=
L(x,a) = f (x) ! ag(x)
"(x,a) = 0
At a solution p
• The the constraint line g and the contour lines of f must
be tangent
• If they are tangent, their gradient vectors
(perpendiculars) are parallel
• Gradient of g must be 0 – i.e., steepest ascent & so
perpendicular to f
• Gradient of f must also be in the same direction as g
13. 13
How Langrangian solves constrained
optimization
L(x,a) = f (x) ! ag(x) where
"(x,a) = 0
Partial derivatives wrt x recover the parallel normal
constraint
Partial derivatives wrt λ recover the g(x,y)=0
In general,
L(x,a) = f (x) + ai
i
! gi
(x)
In general
L(x,a) = f (x) + ai
i
! gi
(x) a function of n + m variables
n for the x's, m for the a. Differentiating gives n + m equations, each
set to 0. The n eqns differentiated wrt each xi
give the gradient conditions;
the m eqns differentiated wrt each ai
recover the constraints gi
Gradient min of f
constraint condition g
In our case, f(x): ½|| w||2 ; g(x): yi(w•xi +b)–1=0 so Lagrangian is:
min L= ½|| w||2 – Σai[yi(w•xi +b)–1] wrt w, b
We expand the last to get the following L form:
min L= ½|| w||2 – Σaiyi(w•xi +b) +Σai wrt w, b
14. 14
Lagrangian Formulation
• So in the SVM problem the Lagrangian is
• From the property that the derivatives at min = 0
we get:
min LP
= 1
2
w
2
! ai
i=1
l
" yi
xi
# w + b
( )+ ai
i=1
l
"
s.t. $i,ai
% 0 where l is the # of training points
w = ai
i=1
l
! yi
xi
, ai
i=1
l
! yi
= 0
!LP
!w
= w " ai yixi = 0
i=1
l
#
!LP
!b
= ai yi = 0 so
i=1
l
"
What’s with this Lp business?
• This indicates that this is the primal form of the
optimization problem
• We will actually solve the optimization problem
by now solving for the dual of this original
problem
• What is this dual formulation?
15. 15
The Lagrangian Dual Problem: instead of minimizing over w, b,
subject to constraints involving a’s, we can maximize over a (the
dual variable) subject to the relations obtained previously for w
and b
w = ai
i=1
l
! yi
xi
, ai
i=1
l
! yi
= 0
Our solution must satisfy these two relations:
By substituting for w and b back in the original eqn we can get rid of the
dependence on w and b.
Note first that we already now have our answer for what the weights w
must be: they are a linear combination of the training inputs and the
training outputs, xi and yi and the values of a. We will now solve for the
a’s by differentiating the dual problem wrt a, and setting it to zero. Most
of the a’s will turn out to have the value zero. The non-zero a’s will
correspond to the support vectors
Primal problem:
min LP
= 1
2
w
2
! ai
i=1
l
" yi
xi
# w + b
( )+ ai
i=1
l
"
s.t. $i ai
% 0
w = ai
i=1
l
! yi
xi
, ai
i=1
l
! yi
= 0
Dual problem:
max LD
(ai
) = ai
i=1
l
! "
1
2
ai
aj
i=1
l
! yi
yj
xi
#x j
( )
s.t. ai
yi
= 0
i=1
l
! & ai
$ 0
(note that we have removed the dependence on w and b)
16. 16
The Dual problem
• Kuhn-Tucker theorem: the solution we find here will
be the same as the solution to the original problem
• Q: But why are we doing this???? (why not just
solve the original problem????)
• Ans: Because this will let us solve the problem by
computing the just the inner products of xi, xj (which
will be very important later on when we want to
solve non-linearly separable classification problems)
The Dual Problem
Dual problem:
max LD
(ai
) = ai
i=1
l
! "
1
2
ai
aj
i=1
l
! yi
yj
xi
#x j
( )
s.t. ai
yi
= 0
i=1
l
! & ai
$ 0
Notice that all we have are the dot products of xi,xj
If we take the derivative wrt a and set it equal to zero,
we get the following solution, so we can solve for ai:
ai yi
i=1
l
! = 0
0 " ai " C
17. 17
Now knowing the ai we can find the
weights w for the maximal margin
separating hyperplane:
w = ai
i=1
l
! yi
xi
And now, after training and finding the w by this method,
given an unknown point u measured on features xi we
can classify it by looking at the sign of:
f (x) = wiu + b = ( ai yixi iu) + b
i=1
l
!
Remember: most of the weights wi, i.e., the a’s, will be zero
Only the support vectors (on the gutters or margin) will have nonzero
weights or a’s – this reduces the dimensionality of the solution
Why should inner product kernels be involved in pattern
recognition using SVMs, or at all?
– Intuition is that inner products provide some measure of
‘similarity’
– Inner product in 2D between 2 vectors of unit length returns the
cosine of the angle between them = how ‘far apart’ they are
e.g. x = [1, 0]T , y = [0, 1]T
i.e. if they are parallel their inner product is 1 (completely similar)
xT y = x•y = 1
If they are perpendicular (completely unlike) their inner product is
0 (so should not contribute to the correct classifier)
xT• y = x•y = 0
Inner products, similarity, and SVMs
18. 18
Insight into inner products
Consider that we are trying to maximize the form:
LD
(ai
) = ai
i=1
l
! "
1
2
ai
aj
i=1
l
! yi
yj
xi
#x j
( )
s.t. ai
yi
= 0
i=1
l
! & ai
$ 0
The claim is that this function will be maximized if we give nonzero values to a’s that
correspond to the support vectors, ie, those that ‘matter’ in fixing the maximum width
margin (‘street’). Well, consider what this looks like. Note first from the constraint
condition that all the a’s are positive. Now let’s think about a few cases.
Case 1. If two features xi , xj are completely dissimilar, their dot product is 0, and they don’t
contribute to L.
Case 2. If two features xi,xj are completely alike, their dot product is 0. There are 2 subcases.
Subcase 1: both xi,and xj predict the same output value yi (either +1 or –1). Then yi
x yj is always 1, and the value of aiajyiyjxixj will be positive. But this would decrease the
value of L (since it would subtract from the first term sum). So, the algorithm downgrades
similar feature vectors that make the same prediction.
Subcase 2: xi,and xj make opposite predictions about the output value yi (ie, one is
+1, the other –1), but are otherwise very closely similar: then the product aiajyiyjxix is
negative and we are subtracting it, so this adds to the sum, maximizing it. This is precisely
the examples we are looking for: the critical ones that tell the two classses apart.
Insight into inner products, graphically: 2 very
very similar xi, xj vectors that predict difft
classes tend to maximize the margin width
xi
xj
19. 19
2 vectors that are similar but predict the
same class are redundant
xi
xj
2 dissimilar (orthogonal) vectors don’t
count at all
xi
xj
21. 21
x
x
x
x
x
x x
ϕ (o)
X F
ϕ
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (x)
ϕ (o)
ϕ (o)
ϕ (o)
ϕ (o)
ϕ (o)
ϕ (o)
o
o
o
o
o
o
Transformation to separate
Non–Linear SVMs
a
b
( )( ) ( )
2
x a x b x a b x ab
! ! = ! + +
{ }
2
,
x x x
!
• The idea is to gain linearly separation by mapping the data to
a higher dimensional space
– The following set can’t be separated by a linear function,
but can be separated by a quadratic one
– So if we map
we gain linear separation
22. 22
Problems with linear SVM
=-1
=+1
What if the decision function is not linear? What transform would separate these?
Ans: polar coordinates!
Non-linear SVM
The Kernel trick
=-1
=+1
Imagine a function φ that maps the data into another space:
φ=Radial→Η
=-1
=+1
Remember the function we want to optimize: Ld = ∑ai – ½∑ai ajyiyj (xi•xj) where (xi•xj) is the
dot product of the two feature vectors. If we now transform to φ, instead of computing this
dot product (xi•xj) we will have to compute (φ (xi)• φ (xj)). But how can we do this? This is
expensive and time consuming (suppose φ is a quartic polynomial… or worse, we don’t know the
function explicitly. Well, here is the neat thing:
If there is a ”kernel function” K such that K(xi,xj) = φ (xi)• φ (xj), then we do not need to know
or compute φ at all!! That is, the kernel function defines inner products in the transformed space.
Or, it defines similarity in the transformed space.
Radial Η
φ
23. 23
Non-linear SVMs
So, the function we end up optimizing is:
Ld = ∑ai – ½∑aiaj yiyjK(xi•xj),
Kernel example: The polynomial kernel
K(xi,xj) = (xi•xj + 1)p, where p is a tunable parameter
Note: Evaluating K only requires one addition and one exponentiation
more than the original dot product
Examples for Non Linear SVMs
( ) ( )
, 1
p
K = ! +
x y x y
( ) { }
2
2
2
, exp
K !
"
= " x y
x y
( ) ( )
, tanh
K ! "
= # $
x y x y
1st is polynomial (includes x•x as special case)
2nd is radial basis function (gaussians)
3rd is sigmoid (neural net activation function)
24. 24
We’ve already seen such nonlinear
transforms…
• What is it???
• tanh(β0xTxi + β1)
• It’s the sigmoid
transform (for neural
nets)
• So, SVMs subsume
neural nets! (but w/o
their problems…)
Inner Product Kernels
Actually works only for
some values of β0 and
β1
tanh(β0xTxi + β1)
Two layer neural net
The width σ2 is
specified a priori
exp(1/(2σ2)||x-xi||2)
Radial-basis function
(RBF)
Power p is specified a
priori by the user
(xTxi + 1)p
Polynomial learning
machine
Usual inner product
Inner Product Kernel
K(x,xi), I = 1, 2, …, N
Type of Support Vector
Machine
25. 25
Kernels generalize the notion of ‘inner
product similarity’
Note that one can define kernels over more than just
vectors: strings, trees, structures, … in fact, just about
anything
A very powerful idea: used in comparing DNA, protein
structure, sentence structures, etc.
Examples for Non Linear SVMs 2 –
Gaussian Kernel
Gaussian
Linear
27. 27
Overfitting by SVM
Every point is a support vector… too much freedom to bend to fit the
training data – no generalization.
In fact, SVMs have an ‘automatic’ way to avoid such issues, but we
won’t cover it here… see the book by Vapnik, 1995. (We add a
penalty function for mistakes made after training by over-fitting: recall
that if one over-fits, then one will tend to make errors on new data.
This penalty fn can be put into the quadratic programming problem
directly. You don’t need to know this for this course.)