Explanation of how a radial basis function works in support vector machines

Question

I am having trouble grasping exactly how an SVM works when using a RBF. My knowledge of mathematics is ok, but so far every explanation I have come across is too terse for me. My current understanding is as follows. Let's assume I'm using an SVM as a binary classifier for a dataset that is not linearly separable (so an rbf is the correct choice?). When the SVM is trained it will plot a hyperplane(which I think is like a plane in 3d but with more dimensions?) that best separates the data.

When tuning, changing the value of gamma changes the surface of the of the hyperplane (also called the decision boundary?).

This is where I start getting properly confused..

So an increase in the value of gamma, results in a Gaussian which is narrower. Is this like saying that the bumps on the plane (if plotted in 3d) that can be plotted are allowed to be narrower to fit the training data better? Or in 2D is this like saying gamma defines how bendy the line that separtates the data can be?

I'm also very confused about about how this can lead to an infinite dimensional representation from a finite number of features? Any good analogies would help me greatly.

Answer 1

(so an rbf is the correct choice?)

It depends. RBF is very simple, generic kernel which might be used, but there are dozens of others. Take a look for example at the ones included in pykernels https://github.com/gmum/pykernels

When the SVM is trained it will plot a hyperplane(which I think is like a plane in 3d but with more dimensions?) that best separates the data.

Lets avoid some weird confusions. Nothing is plotted here. SVM will look for d-dimensional hyperplane defined by v (normal vector) and b (bias, distance from the origin), which is simply set of points x such that <v, x> = b . In 2D hyperplane is a line, in 3D hyperplane is plane, in d+1 dimensions it is d dimensional object, always one dimension lower than the space (line is 1D, plane is 2D).

When tuning, changing the value of gamma changes the surface of the of the hyperplane (also called the decision boundary?).

Now this is an often mistake. Decision boundary is not a hyperplane. Decision boundary is a projection of the hyperplane onto input space. You cannot observe actual hyperplane as it is often of very high dimension. You can express this hyperplane as a functional equation, but nothing more. Decision boundary on the other hand "lives" in your input space, if input is low-dimensional, you can even plot this object. But this is not a hyperplane, it is just the way this hyperplane intersects with your input space. This is why decision boundary is often curved or even discontinous even though hyperplane is always linear and continuous - because you just see a nonlinear section through it. Now what is gamma doing? RBF kernel leads to the optimization in the space of continous functions . There are plenty ot these (there is continuum of such objects). However, SVM can express only a tiny fraction of these guys - linear combinations of kernel values in training points. Fixing particular gamma limits set of functions to consider - bigger the gamma, more narrow the kernels, thus functions that are being considered consists of linear combinations of such "spiky" distributions. So gamma itself does not change the surface, it changes the space of considered hypotheses.

So an increase in the value of gamma, results in a Gaussian which is narrower. Is this like saying that the bumps on the plane (if plotted in 3d) that can be plotted are allowed to be narrower to fit the training data better? Or in 2D is this like saying gamma defines how bendy the line that separtates the data can be?

I think I answered with previous point - high gamma means that you only consider hyperplanes of form

<v, x> - b = SUM_i alpha_i K_gamma(x_i, x) - b

where K_gamma(x_i, x) = exp(-gamma ||x_i-x||^2) , thus you will get very "spiky" elements of your basis. this will lead to very tight fit to your training data. Exact shape of the decision boundary is hard to estimate, as this depends on optimal lagrange multipliers alpha_i selected during training.

I'm also very confused about about how this can lead to an infinite dimensional representation from a finite number of features? Any good analogies would help me greatly.

The "infinite representation" comes from the fact, that in order to work with vectors and hyperplanes, each of your point is actually mapped to a continuous function . So SVM, internally, is not really working with d-dimensional points anymore, it is working with functions. Consider 2d case, you have points [0,0] and [1,1]. This is a simple 2d problem. When you apply SVM with rbf kernel here - you will instead work with an unnormalized gaussian distribution centered in [0, 0] and another one in [1,1]. Each such gaussian is a function from R^2 to R, which expresses its probability density function (pdf). It is a bit confusing because kernel looks like a gaussian too, but this is only because dot product of two functions is usually defined as an integral of their product, and integral of product of two gaussians is .... a gaussian too! So where is this infinity? Remember that you are supposed to work with vectors. How to write down a function as a vector? You would have to list all its values, thus if you have a function f(x) = 1/sqrt(2*pi(sigma^2) exp(-||xm||^2 / (2*sigma^2)) you will have to list infinite number of such values to fully define it. And this is this concept of infinite dimension - you are mapping points to functions, functions are infinite dimensional in terms of vector spaces, thus your representation is infinitely dimensional.

One good example might be different mapping. Consider a 1D dataset of numbers 1,2,3,4,5,6,7,8,9,10. Lets assign odd numbers different label than even ones. You cannot linearly separate these guys. But you can instead map each point (number) to a kind of characteristic function, function of the form

f_x(y) = 1 iff x e [y-0.5, y+0.5]

now, in space of all such functions I can easily linearly separate the ones created from odd x's from the rest, by simply building hyperplane of equation

<v, x> = SUM_[v_odd] <f_[v_odd](y), f_x(y)> = INTEGRAL (f_v * f_x) (y) dy

And this will equal 1 iff x is odd, as only this integral will be non zero. Obviously I am just using finite amount of training points (v_odd here) but the representation itself is infinite dimensional. Where is this additional "information" coming from? From my assumptions - the way I defined the mapping introduces a particular structure in the space I am considering. Similarly with RBF - you get infinite dimension, but it does not mean you are actually considering every continouus function - you are limiting yourself to the linear combinations of gaussians centered in training points. Similarly you could use sinusoidal kernel which limits you to the combinations of sinusoidal functions. The choice of a particular, "best" kernel is the whole other story, complex and without clear answers. Hope this helps a bit.

Explanation of how a radial basis function works in support vector machines

Question

1 answers

solution1
4 ACCPTED 2016-03-29 20:35:11

Explanation of how a radial basis function works in support vector machines

Question

1 answers

solution1 4 ACCPTED 2016-03-29 20:35:11

solution1
4 ACCPTED 2016-03-29 20:35:11