Here we will be discussing the role of Hinge loss in SVM hard margin and soft margin classifiers, understanding the optimization process, and kernel trick.

**Support Vector Machine(SVM)**

**Support Vector Machine(SVM)**

Support Vector Machine(SVM) is a supervised machine learning algorithm for classification and regression. Let us use the binary classification case to understand the Hinge loss. In the case of binary classification, the objective of SVM is to construct a hyperplane that divides the input data in such a way that all the data that belongs to class 1 lie on one side of the plane and all the data that belongs to class -1 lie on the other side of the plane.

The objective function of the SVM classifier is (for derivation please refer to the link of the SVM article shared above)

Minimize: [Tex]\frac{1}{2}* {||w||}^2[/Tex]

Subject to: [Tex] yᵢ * (w·xᵢ + b) [/Tex] ≥ 1 for all training examples (xᵢ)

Here,

- “w” represents the weight vector of the hyperplane, “b” is the bias term,
- “xᵢ” are the training data points, and “yᵢ” is the class label of the data point (either -1 or 1 for a binary classification problem).
- The objective is to minimize the L2-norm of the weight vector, which corresponds to maximizing the margin between the two classes.

The inequality constraint [Tex](yᵢ * (w·xᵢ + b) ≥ 1)[/Tex] ensures that all data points are correctly classified, and those closest to the decision boundary (support vectors) satisfy the margin constraint.

There are two types of SVM

Here we aim to classify all the points correctly. We assume the data is linearly separable by a hyperplane.**Hard Margin:**In the real world, the datasets are typically not linearly separable. In soft margin SVM we allow the model to misclassify some data points. We penalize the model for such misclassification .**Soft Margin:**

We will study hard margin and soft margin SVM in detail latter. Let us first understand hinge loss.

Support Vector Machine

## Hinge Loss

Hinge loss is used in binary classification problems where the objective is to separate the data points in two classes typically labeled as +1 and -1.

Mathematically, Hinge loss for a data point can be represented as :

[Tex]L(y, f(x)) = max(0, 1 – y * f(x))[/Tex]

Here,

- y- the actual class (-1 or 1)
- f(x) – the output of the classifier for the datapoint

Lets understand it with the help of below graph

Hinge Loss

**Case 1 : Correct Classification and |y| ≥ 1**

**Case 1 : Correct Classification and |y| ≥ 1**

In this case the product t.y will always be positive and its value greater than 1 and therefore the value of 1-t.y will be negative. So the loss function value max(0,1-t.y) will always be zero. This is indicated by the green region in above graph. Here there is no penalty to the model as model correctly classifies the data point.

**Case 2 : Correct Classification and |y| < 1**

**Case 2 : Correct Classification and |y| < 1**

In this case the product t.y will always be positive , but its value will be less than 1 and therefore the value of 1-t.y will be positive with value ranging between 0 to 1. Hence the loss function value will be the value of 1-t.y. This is indicated by the yellow region in above graph. Here though the model has correctly classified the data we are penalizing the model because it has not classified it with much confidence (|y| < 1) as the classification score is less than 1. We want the model to have a classification score of at least 1 for all the points.

**Case 3: Incorrect Classification**

**Case 3: Incorrect Classification**

In this case either of t or y will be negative. Therefore the product t.y will always be negative and the value of (1-t)y will be always positive and greater than 1. So the loss function value max(0,1-t.y) will always be the value given by (1-t)y . Here the loss value will increase linearly with increase in value of y. This is indicated by the red region in above graph.

**Relationship Between Hinge Loss and SVM**

**Relationship Between Hinge Loss and SVM**

Let us understand the relationship between hinge loss and svm mathematically .

### Hard Margin and Hinge Loss

A hard margivn SVM is a type of SVM which aims to find a hyperpalne that perfectly separates the two classes without any misclassification. It assumes that the data is linearly separable, and the objective is to maximize the margin while ensuring that all training data points are correctly classified

So mathematically speaking for hard margin we want our model to classify all the points in such a way that [Tex]y * (w.x +b) [/Tex]is at least 1 while minimizing the weight vector w. Thus a good classifier i.e. a good hyperplane will be one that will give a large positive value of [Tex]y*(wx+b) [/Tex]for all the points. It encourages the SVM to find a hyperplane that not only separates the two classes but also maximizes the margin between them. Mathematically

Minimize: [Tex]\frac{1}{2}* {||w||}^2[/Tex]

Subject to: [Tex] yᵢ * (w·xᵢ + b)[/Tex] ≥ 1 for all training examples (xᵢ)

If we look at the mathematical formulation the *hinge loss is effectively present in the constraints *of a hard margin. This ensures that the decision boundary (the hyperplane) is positioned in such a way that it maximizes the margin without allowing any data points to be within or on the wrong side of the margin.

Now solving a hardmargin equation becomes solving a constrained NLP (Non LInear Programming) problem. Here we need to find the lagrange function and KKT conditions to solve this.

**Lagrange Function and KKT Equation**

**Lagrange Function and KKT Equation**

The Lagrange function incorporates the original objective function along with terms that account for the constraints. The Lagrange multipliers λ and μ are introduced to measure the impact of constraints on the objective function. Let us briefly look at generalized lagrange function.

Given an objective function f(x) to be minimized or maximized, subject to a set of constraints [Tex]gi(x) = 0[/Tex] (for equality constraints) and [Tex]hk(x) ≤ 0[/Tex] (for inequality constraints), the Lagrange function L(x, λ, μ) is defined as:

[Tex]L(x, λ, μ) = f(x) + Σ(λi * gi(x)) + Σ(μj * hj(x)) [/Tex]

Where:

- L(x, λ, μ) is the Lagrange function.
- x is the vector of decision variables.
- λ (lambda) is a vector of Lagrange multipliers associated with the equality constraints g
_{i}(x) = 0. - μ (mu) is a vector of Lagrange multipliers associated with the inequality constraints h
_{k}(x) ≤ 0.

To find the solution to the constrained optimization problem, we typically set the gradient of the Lagrange function with respect to the decision variables (x) and the Lagrange multipliers (λ and μ) to zero and solve the resulting system of equations. This approach is known as the Karush-Kuhn-Tucker (KKT) conditions, and it helps identify the optimal solution that satisfies the constraints.

The above equation is called primal form of equation. When we cannot solve the lagrangian function in its primal form we tend to write the dual form and try to solve it.

The dual form of lagrangian function is written by converting it to max min form

[Tex]D(λ, μ) = max(λ, μ) min(x) L(x, λ, μ) [/Tex] where λ, μ ≥ 0

where D(λ, μ) is the dual equation

Dual form removes the constraints of g(x) and h(x). In SVM we will use the dual form.

### Solving SVM using Lagrange and KKT

Using the above method the SVM problem can be formulated as :

Minimize: [Tex]L(w,b,λ) = \frac{1}{2} * {||w||}^2 – ∑λi[yᵢ * (w·xᵢ + b)-1] [/Tex] ———- Eq (1)

- λ
_{i}are lagrange multipliers. - Here we have converted our greater than equality sign to less than i.e. [Tex]yᵢ * (w·xᵢ + b)[/Tex] ≥ 1 =>. – [Tex][yᵢ * (w·xᵢ + b) -1][/Tex] ≤ 0 to incorporate into the Lagrange equation.
- Summation is over all data points

Compute the partial derivatives of the Lagrange function (EQ(1)) with respect to w, b and λ_{i} and set them to zero to find the critical points

[Tex]\frac{dl}{dw} = w-∑λi*yᵢ * xᵢ = 0[/Tex] ——– Eq(2)

[Tex]\frac{dl}{dλ} = ∑[yᵢ * (w·xᵢ + b)-1] = 0[/Tex] ——– Eq(3)

[Tex]\frac{dl}{db} = ∑λiyi = 0[/Tex] ——– Eq(4)

For Eq(2) we get

[Tex]w = ∑λi*yᵢ * xᵢ[/Tex] —— Eq(5)

Substituting the value of w in form Eq(5) into Eq(1)

[Tex]L(w,b,λ) = \frac{1}{2} * ∑i||λi*yᵢ * x||2 – ∑jλj[yj * (∑iλi*yᵢ * xᵢ·xj + b) +∑jλj[/Tex]

[Tex]L(w,b,λ) = \frac{1}{2} * (∑iλiyᵢx) (∑iλiyᵢ x – ∑i∑jλiyᵢλjyj xᵢ·xj + b∑iλiyᵢ +∑jλj [/Tex]—— Eq(6)

The term b∑iλiyᵢ = 0 if we consider Eq(4). Thus Eq (6) can be rearranged to

[Tex]L(w,b,λ) = ∑jλj – \frac{1}{2} * ∑i∑jλiyᵢλjyj xᵢ·xj [/Tex]

Using the dual form we can write:

[Tex]D(λ) = max(λ)min(w,b) L(w,b,λ)[/Tex]

Sine we have obtaind L(w,b,)λ in form of λ only we can simply write dual as

[Tex]D(λ) = max(λ) [ ∑jλj – 1/2 * ∑i∑jλiyᵢλjyj xᵢ·xj ][/Tex] ——— EQ(7)

Thus we have converted our constrained optimization to unconstrained optimization . This is a special case of NON LINEAR PROGRAMMING known as QUADRATIC PROGRAMMING. Above maximization operation can be solved with the SMO ( sequential minimization optimization) algorithms designed to solve the quadratic programming problem that arises when training a SVM.

## Understanding the Kernel Trick

It may be possible that the data is not linearly seperable in the given dimension however it could be linearly separable in higher dimension. To find a hyperplane in higher dimension , one approaach would be we would need to covnert our data in n higher dimension and proceed with our optimization process.

However calculating features in higher dimension is computationally expensive. If we closely look at the equation 7 we observe that all we need is dot porduct of each observation with other observation and not the individual vector to solve our optimization problem. This fact is utilized in kernel trick. We use a kernel i.e. a function which takes two vector as input and gives us the inner product of the transformed feature vectors in the higher dimension space without explicitly calculating the individual features!

Some commonly used kernls are

- Linear Kernel
- Polynomial Kernal
- RBF Kernal

The choice of kernel function depends on the nature of the data and the problem at hand. SVMs with different kernels can capture different types of decision boundaries. For example, the RBF kernel is commonly used for capturing complex, nonlinear decision boundaries, while the linear kernel is suitable for linearly separable data.

A hard margin is an idealized form of SVM which assumes perfect separability. However most datasets are not perfectly linearly separable. This is where Soft margin SVM comes.

**Support Vector in SVM**

**Support Vector in SVM**

Most of the lagrange multipliers will be zero except for those data points which lies on the boundary of margin plane. These vectors are called as support vectors as they guide the optimization since the hyperplane has to maximize the distance margin from these vectors. Hence the name support vector margins.

### Soft Margin and Hinge Loss

Since many real practical dataset contains points that are not linealy seprable we need a plane that allows minimum misclassification of data points. For this we have Soft margin SVM. A soft margin SVM is an extension of hard marginv SVM that allows for some misclassification of data points withing the margin.

The soft margin SVM can be mathematically represented as

Minimize: [Tex]\frac{1}{2} * {||w||}^2 + C * Σ ϵi[/Tex]

where [Tex]yᵢ * (w·xᵢ + b) ≥ 1 -ϵi[/Tex]

Here we introduce slack variable ϵ_{i}.

- if confidence score < 1, it means that classifier did not classify the point correctly and incurring a linear penalty of ϵ
_{i} - If 0<ϵ
_{i}<1 it means the point is correctly classified but lies between the hyperplane and margin plane - If ϵ
_{i}>1 if means the point is on wrong side of hyperplane - C is a regularization parameter that balances the trade-off between maximizing the margin and minimizing classification errors.

Here the ** hinge loss component, is part of the objective function itself through slack variable**. It encourages the SVM to correctly classify or have a margin of at least 1 for each training data point. The term, [Tex]\frac{1}{2} * ||w||^2[/Tex], encourages a simple and effective separating hyperplane. The SVM aims to find the optimal values of w and b that minimize this combined objective.C, which controls the trade-off between maximizing the margin and tolerating misclassification. If value of C = 0 we effectively train hard margin classifier.

The lagrange equation becomes:

[Tex]L(w,b,λ) = \frac{1}{2} * {||w||}^2 – ∑λi[yᵢ * (w·xᵢ + b)-1 + ϵi][/Tex]

This can be similarly solved using the method discussed in hard margin classifier.

## Implementing Hinge Loss Functions in Python

We will use iris dataset to construct a SVM classifier using Hinge loss. The notebook is available at SVM Hinge Loss

#### 1. Import Necessary Libraries

from sklearn import datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import confusion_matrix, precision_score, recall_score

#### 2. Load the IRIS Dataset

iris = datasets.load_iris()

#### 3. Split in Train and Test Set

X_train, X_test, y_train, y_test = train_test_split( iris['data'], iris['target'], test_size=0.33, random_state=42)

#### 4. Model training using Hinge Loss

- We have imported SGD Classifier from scikit-learn and specified the loss function as ‘hinge’.
- Model uses the training data and corresponding labels to classify data based on hinge loss function.

# Using hinge loss from sklearn.linear_model import SGDClassifier clf_hinge = SGDClassifier(loss="hinge", max_iter=1000) clf_hinge.fit(X_train, y_train) y_test_pred_hinge = clf_hinge.predict(X_test)

#### 5. Model Evaluation when, loss = ‘hinge’

print('\033[1m' + "Hinge Loss" + '\033[0m') print( f"Precision score : {precision_score(y_test_pred_hinge,y_test,average='weighted')}") print( f"Recall score : {recall_score(y_test_pred_hinge,y_test,average='weighted')}") print("Confusion Matrix") confusion_matrix(y_test_pred_hinge, y_test)

**Output :**

Hinge Loss

Precision score : 0.95125

Recall score : 0.94

Confusion Matrix

array([[19, 0, 0],

[ 0, 15, 3],

[ 0, 0, 13]])

### Advantages of using hinge loss for SVMs

There are several advantages to using hinge loss for SVMs:

- Hinge loss is a simple and efficient loss function to optimize.
- Hinge loss is robust to noise in the data.
- Hinge loss encourages SVMs to find hyperplanes with a large margin.

### Disadvantages of using hinge loss for SVMs

There are a few disadvantages to using hinge loss for SVMs:

- Hinge loss is not differentiable at zero.This can make it difficult to optimize using some gradient-based methods.
- Hinge loss can be sensitive to outliers.

## Conclusion

Hinge loss is a popular loss function for training SVMs. It is simple, efficient, and robust to noise in the data. However, it is not differentiable at zero and can be sensitive to outliers.

Overall, hinge loss is a good choice for training SVMs for classification and regression tasks.

Next Article

Optimal feature selection for Support Vector Machines