Support Vector Machines Explained


What is a Support Vector Machine?
In machine learningsupport vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.

SVM performs classification by finding the hyperplane that maximizes the margin between two classes. The vectors that define the hyperplane are called "support vectors".

Why is it important to find the optimal hyperplane?
There can be many possibilities when it comes to finding the hyperplane that separates the different classes.
If we compare those 3 hyperplanes in the graph, hyperplanes 1 & 2 are very close the points and the margin/gap between the points are very less. In that case, points can be easily misclassified. Whereas the optimal hyperplane is positioned right at the middle with a larger margin from the support vectors both sides.
It is important to have the largest margin/gap possible when it comes to separating different groups of data.

NOTE: Here, only the support vectors are important, as these points decides where the hyperplane sits. Other points are ignorable.


Non-Linear Support Vector Machines
Lets look at the example below:
Lets say we are presented with data that is in a straight line (1-dimensional), it is impossible to separate the 2 classes. We can apply a function that separates our data into higher dimensional space (2-dimensional) that makes it easier to draw our hyperplane.

Lets look at another example:

The data in 2-dimensional space is inseparable, whereas when it is projected in 3D feature space, we see that there is a clear separation between the groups.
And thus, these 2 groups are separated as shown below.

Problem:
The only problem in transformation to a higher dimensional feature space is that it is computationally very expensive.

Solution:
"KERNEL TRICK" : this reduces the computational cost.

A function that takes input vectors in original space and returns the dot products of the vectors in feature space is called a "kernel function".
Using this kernel function we can apply the dot products between vectors so that every point is mapped to a higher dimensional space via some transformation.

Advantages:
1. effective in higher dimensional space.
2. different kernels for different decision functions.
3. can add kernel functions together to achieve more complex hyperplanes.

Disadvantages:
1. if #features > #samples, this can lead to poor performance.
2. do not provide probability estimates.


Comments

Popular posts from this blog

A/B Testing: Bernoulli Model VS the Beta-Binomial Hierarchical Model

Recurrent Neural Networks and LSTM explained

Decision Tree Explained