x <- matrix ( rnorm (20 * 2) , ncol = 2)
y <- c( rep (-1, 10) , rep (1, 10) )
x[y == 1, ] <- x[y == 1, ] + 1
plot (x, col = (3 - y))
The Maximal Margin Classifier will impose a hyperplane on a graph that will classify the data given a vector of predictor variables.
Given a p-dimensional space, a hyperplane is flat affine subspace of p-1 dimensions. It is mathematically defined as:
\[ \beta_0 + \beta_1X_1 + \beta_2 X_2 + \cdots+\beta_pX_p = 0 \]
A hyperplane is constructed by maximizing the margin \(M\) of the data points that are farthest from the theoretical margin. The data points that define the outer edge of the margins are known as support vectors.
\(\overset{\mathrm{maximize}}{\tiny\beta_0, \beta_1,\ldots,\beta_p,M}\ \large M\)
subject to \(\sum^p_{j=1}\beta_j^2 = 1\)
\(y_i(\beta_0 + \beta_1X_{1i} + \beta_2 X_{2i} + \cdots+\beta_pX_{pi})\geq M \ \forall \ i=1,\ldots,n\)
Maximal Margin Classifiers have one fatal defect, the data points must be completely on one side of the margin. This does not allow room for error.
A Support Vector Classifier allows for data points to be misclassified if need be.
It achieves this by implementing a Cost mechanism, denoted as \(C\), to account for any errors for data points.
\(\overset{\mathrm{maximize}}{\tiny\beta_0, \beta_1,\ldots,\beta_p,\epsilon_1,\ldots, \epsilon_n,M}\ \large M\)
subject to \(\sum^p_{j=1}\beta_j^2 = 1\)
\(y_i(\beta_0 + \beta_1X_{1i} + \beta_2 X_{2i} + \cdots+\beta_pX_{pi})\geq M (1-\epsilon_i) \ \forall \ i=1,\ldots,n\)
\(\epsilon_i\geq 0\)
\(\sum^n_{i=1} \epsilon_i \leq C\)
The tuning parameter \(C\) is known as the budget parameter for error. When the data point is on the correct side of the margin, then it has an error of \(0\). When a data point in on the wrong side the margin, it has a bit of error. When the data point is on the opposite side of the hyperplane, then it has an error greater than \(1\). This is allowed as long as the sum of errors are less than or equal to \(C\).
A Support Vector Machine will create a nonlinear boundary instead of a line.
It incorporates a kernel function that will compute the similarities between two support vectors.
The kernel function can be loosely claimed how the data is modeled.
Linear
Polynomial
Radial
#>
#> Parameter tuning of 'svm':
#>
#> - sampling method: 10-fold cross validation
#>
#> - best parameters:
#> cost
#> 0.1
#>
#> - best performance: 0.3
#>
#> - Detailed performance results:
#> cost error dispersion
#> 1 1e-03 0.55 0.4377975
#> 2 1e-02 0.55 0.4377975
#> 3 1e-01 0.30 0.2581989
#> 4 1e+00 0.40 0.3162278
#> 5 5e+00 0.40 0.3162278
#> 6 1e+01 0.35 0.3374743
#> 7 1e+02 0.35 0.3374743
#>
#> Call:
#> best.tune(METHOD = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001,
#> 0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
#>
#>
#> Parameters:
#> SVM-Type: C-classification
#> SVM-Kernel: linear
#> cost: 0.1
#>
#> Number of Support Vectors: 16
#>
#> ( 8 8 )
#>
#>
#> Number of Classes: 2
#>
#> Levels:
#> -1 1