Machine Learning Understanding Support Vector Machine SVM

Usually use the ready-made libsvm directly, although it is difficult to understand the principle of SVM, but still try to know it, but also know why. This article is mainly a conceptual sharing of SVM.
basic concept
Support Vector Machine (SVM) is the first proposed by Corinna Cortes and Vapnik in 1995. It shows many unique advantages in solving small sample, nonlinear and high-dimensional pattern recognition, and can be applied to function simulation. Wait for other machine learning problems.
A support vector machine, which means a classifier that supports vector operations. The term "machine" means a machine and can be understood as a classifier.
What is a support vector? In the process of solving, it is found that the classifier can be determined based only on part of the data, which is called a support vector.
See the figure below. In a two-dimensional environment, the points R, S, G and other points near the middle black line can be seen as support vectors, which can determine the specific parameters of the classifier, that is, the black line.
Support vector machines (SVMs) are similar to neural networks. They are learning mechanisms, but unlike neural networks, SVM uses mathematical methods and optimization techniques.
In order to better understand the SVM, the following is the return from the logisTIc, which leads to the SVM, which not only reveals the connection between the models, but also makes the transition more natural.
Re-examine logisTIc regression
The purpose of LogisTIc regression is to learn a 0/1 classification model from features, and this model takes a linear combination of characteristics as an independent variable, since the value range of the independent variable is negative infinity to positive infinity. Therefore, the argument is mapped to (0, 1) using the logisTIc function (or sigmoid function), and the mapped value is considered to be the probability of y=1.
Formal representation is
Hypothetical function
Where x is the n-dimensional eigenvector and the function g is the logistic function.
Then the image of g(z) is
As you can see, infinity is mapped to (0,1).
The hypothesis function is the probability that the feature belongs to y=1.
When we want to discriminate which class a new feature belongs to, we only need hÎ¸(x). If it is greater than 0.5, it is a class with y=1, and vice versa.
Looking at hÎ¸(x) again, it is found that hÎ¸(x) is only related to Î¸T, Î¸T(x) "0", then hÎ¸(x)"0.5, g(z) is only used for mapping, and the real category decision is still At Î¸T(x). Also, Î¸T(x)â‰¥0, hÎ¸(x)=1, and hÎ¸(x)=0. If we only start from Î¸T(x), we hope that the goal achieved by the model is nothing more than letting the feature Î¸T(x) â‰¥ 0 of y=1 in the training data, but the feature Î¸T(x) â‰¤ 0 of y=0. Logistic regression is to learn Î¸, so that the characteristics of the positive case are much larger than 0, and the characteristics of the negative case are much smaller than 0, emphasizing that this goal is achieved on all training examples.
The graphical representation is as follows:
The middle line is Î¸T(x), and the logistic review emphasizes that all points are as far as possible from the middle line. The result of learning is also the middle line. Consider the above three points A, B and C. From the figure we can determine that A is a Ã— category, but C is not sure, and B is still ok. In this way we can conclude that we should be more concerned with the points close to the middle dividing line, so that they are as far away as possible from the middle line, rather than at all points. Because of that, make a part of the point close to the middle line in exchange for another part of the point farther away from the middle line. I think this is the difference between the idea of â€‹â€‹support vector machine and the logistic regression. One considers the local (don't care about the points that have been determined to be far away), and one considers the global (the points that have been far away may be adjusted further by adjusting the middle line). This is my personal intuitive understanding.
Formal representation
Functional margin and geometric margin
The function interval we just defined is for a sample, now we define the function interval on the global sample.
To put it bluntly, it is to classify the positive and negative examples of the functional interval on the training sample.
Next define the geometric interval, first look at the picture
Optimal margin classifier
Recall that we mentioned earlier that our goal is to find a hyperplane so that there is a greater spacing between points closer to the hyperplane. That is, we do not consider all points must be far from the hyperplane, and we are concerned that the obtained hyperplane can make the maximum distance of the points closest to it from all points. As for the image, we regard the above picture as a piece of paper. We need to find a line. After folding according to this line, the distance between the points closest to the line is larger than the other lines. Formal representation is:
Here we use ||w||=1 to w, so that it is wTx+bi (because we can't enter the formula, write it like this) geometric interval.
At this point, we have defined the model. If we get w and b, then we can classify a feature x, called the optimal interval classifier. The next question is how to solve the problem of w and b.
Since ||w||=1 is not a convex function, we want to deal with the transformation first, considering the relationship between geometric spacing and function spacing, the formula:
We rewrote as:
At this time, the maximum value we are seeking is still the geometric interval, except that w is not constrained by ||w||=1. However, at this time, the objective function is still not a convex function, and it cannot be directly substituted into the optimization software for calculation. We have to rewrite it. As mentioned earlier, expanding w and b at the same time has no effect on the results, but what we finally ask is still the determined values â€‹â€‹of w and b, not their set of multiples.
This is good, only linear constraints, and is a typical quadratic programming problem (the objective function is a quadratic function of the independent variable). Substituting optimization software is solvable. It is found here that although the map is not drawn first, the classification hyperplane is drawn, and the interval is so intuitive on the map, but each step is reasonably valid, and the objective function and constraints are derived by the fluency of the idea.

Insulated Power Cable
Insulated Power Cable,Bimetallic Crimp Lugs Cable,Pvc Copper Cable,Cable With Copper Tube Terminal
Taixing Longyi Terminals Co.,Ltd. , https://www.longyicopperterminals.com