Lesson 3.6: Introduction to Support Vector Machines (SVMs)

We now explore a completely new and powerful way of thinking about classification. Instead of fitting probabilities (like Logistic Regression) or making splits (like Decision Trees), SVMs are a geometric algorithm. Their goal is to find the 'best' possible street that separates two classes of data, maximizing the 'cushion' or margin between them.

Part 1: A New Objective - Maximum Margin

Imagine you have data for two classes of loans: 'Safe' (blue dots) and 'Risky' (red dots). The data is linearly separable, meaning you can draw a straight line that perfectly separates the two classes.

The problem is, there are an **infinite** number of lines you could draw. Which one is the "best"?

The Core Analogy: Finding the Widest 'Street'

Think of the separating line as the center yellow line of a street. An SVM's goal is to find the line that allows for the **widest possible street** to be built between the two classes of data, without any data points (cars) being on the street itself.

Imagine a scatter plot of red and blue dots. A thick 'street' is drawn between them, with the center line perfectly separating the two classes. The 'curbs' of the street touch the closest points of each class.

The center line is called the **hyperplane**.
The "width" of the street is called the **margin**.
The data points that lie on the edge of the street (the "curbs") are called the **support vectors**. They are the most important data points.

An SVM is a **Maximum Margin Classifier**. It finds the hyperplane that maximizes this margin. This is a very intuitive goal, as it creates the most robust possible separation between the classes.

Part 2: The Mathematics of the Margin

To turn this geometric idea into an optimization problem, we need to describe the street mathematically. A hyperplane is defined by the equation $\mathbf{w}^T\mathbf{x} + b = 0$ , where $\mathbf{w}$ is a vector perpendicular to the hyperplane.

The "curbs" of our street are two parallel hyperplanes:

$\mathbf{w}^T\mathbf{x} + b = 1$ (The positive curb, touching the support vectors of the positive class)
$\mathbf{w}^T\mathbf{x} + b = -1$ (The negative curb, touching the support vectors of the negative class)

It can be shown that the distance between these two curbs—the margin—is $2 / \|\mathbf{w}\|$ .

The SVM Optimization Problem (Hard Margin)

To maximize the margin $2 / \|\mathbf{w}\|$ , we must **minimize** the magnitude of the weight vector, $\|\mathbf{w}\|$ . For mathematical convenience, we minimize $\frac{1}{2}\|\mathbf{w}\|^2$ instead.

This gives us the formal optimization problem:

\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2

subject to the constraint that all data points are on the correct side of the street:

y_i(\mathbf{w}^T\mathbf{x}_i + b) \ge 1 \quad \text{for all } i

(where $y_i$ is either +1 or -1).

Part 3: Dealing with the Real World - The Soft Margin Classifier

The "hard margin" SVM we just defined has two major problems: it only works if the data is perfectly linearly separable, and it is very sensitive to outliers.

The solution is to allow for a "softer" street. We will allow some data points to be on the wrong side of the curb, or even on the wrong side of the street entirely, but we will add a **penalty** for each mistake.

The Soft Margin SVM Optimization Problem

We introduce a "slack" variable $\xi_i \ge 0$ for each data point, which measures how much it violates the margin. We then modify the objective function to be a tradeoff:

\min_{\mathbf{w}, b, \bm{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i

subject to the new, softer constraint:

y_i(\mathbf{w}^T\mathbf{x}_i + b) \ge 1 - \xi_i \quad \text{for all } i

The 'C' Hyperparameter: The Margin Regularizer

The new hyperparameter, $C$ , controls the bias-variance tradeoff for the SVM:

Large C: We place a high penalty on margin violations. The model tries very hard to fit every data point correctly, leading to a **narrow margin** and potentially **overfitting** (high variance).
Small C: We place a low penalty on margin violations. The model is happy to misclassify a few points in exchange for creating a **wider, more robust margin**. This leads to a simpler model that may generalize better (higher bias, lower variance).

SVMs: Strengths and Weaknesses

Strengths

Very effective in high-dimensional spaces.
Memory efficient, as it only uses the support vectors for its decision boundary.
Versatile: can be adapted for non-linear problems using the Kernel Trick.

Weaknesses

Can be slow to train on very large datasets.
Performance is highly sensitive to the choice of the `C` hyperparameter.
Does not directly provide probability estimates.

What's Next? The SVM's Magic Trick

We've built a powerful linear classifier. But what if our data is not linearly separable? What if the optimal decision boundary is a circle?

This is where the true power of SVMs comes from. Instead of creating a more complex model, the SVM uses a mathematical "hack" to make the *data* itself look simpler.

In the next lesson, we will explore the **Kernel Trick**, a method that allows the SVM to implicitly project the data into a much higher dimension where it *becomes* linearly separable, allowing it to create incredibly complex, non-linear decision boundaries with ease.

Polynomial Regression: Capturing Non-Linearity within a Linear Framework

The Kernel Trick: How SVMs Map Data to Higher Dimensions