Lesson 3.6: Introduction to Support Vector Machines (SVMs)
We now explore a completely new and powerful way of thinking about classification. Instead of fitting probabilities (like Logistic Regression) or making splits (like Decision Trees), SVMs are a geometric algorithm. Their goal is to find the 'best' possible street that separates two classes of data, maximizing the 'cushion' or margin between them.
Part 1: A New Objective - Maximum Margin
Imagine you have data for two classes of loans: 'Safe' (blue dots) and 'Risky' (red dots). The data is linearly separable, meaning you can draw a straight line that perfectly separates the two classes.
The problem is, there are an **infinite** number of lines you could draw. Which one is the "best"?
The Core Analogy: Finding the Widest 'Street'
Think of the separating line as the center yellow line of a street. An SVM's goal is to find the line that allows for the **widest possible street** to be built between the two classes of data, without any data points (cars) being on the street itself.
Imagine a scatter plot of red and blue dots. A thick 'street' is drawn between them, with the center line perfectly separating the two classes. The 'curbs' of the street touch the closest points of each class.
- The center line is called the **hyperplane**.
- The "width" of the street is called the **margin**.
- The data points that lie on the edge of the street (the "curbs") are called the **support vectors**. They are the most important data points.
An SVM is a **Maximum Margin Classifier**. It finds the hyperplane that maximizes this margin. This is a very intuitive goal, as it creates the most robust possible separation between the classes.
Part 2: The Mathematics of the Margin
To turn this geometric idea into an optimization problem, we need to describe the street mathematically. A hyperplane is defined by the equation , where is a vector perpendicular to the hyperplane.
The "curbs" of our street are two parallel hyperplanes:
- (The positive curb, touching the support vectors of the positive class)
- (The negative curb, touching the support vectors of the negative class)
It can be shown that the distance between these two curbs—the margin—is .
The SVM Optimization Problem (Hard Margin)
To maximize the margin , we must **minimize** the magnitude of the weight vector, . For mathematical convenience, we minimize instead.
This gives us the formal optimization problem:
subject to the constraint that all data points are on the correct side of the street:
(where is either +1 or -1).
Part 3: Dealing with the Real World - The Soft Margin Classifier
The "hard margin" SVM we just defined has two major problems: it only works if the data is perfectly linearly separable, and it is very sensitive to outliers.
The solution is to allow for a "softer" street. We will allow some data points to be on the wrong side of the curb, or even on the wrong side of the street entirely, but we will add a **penalty** for each mistake.
The Soft Margin SVM Optimization Problem
We introduce a "slack" variable for each data point, which measures how much it violates the margin. We then modify the objective function to be a tradeoff:
subject to the new, softer constraint:
The 'C' Hyperparameter: The Margin Regularizer
The new hyperparameter, , controls the bias-variance tradeoff for the SVM:
- Large C: We place a high penalty on margin violations. The model tries very hard to fit every data point correctly, leading to a **narrow margin** and potentially **overfitting** (high variance).
- Small C: We place a low penalty on margin violations. The model is happy to misclassify a few points in exchange for creating a **wider, more robust margin**. This leads to a simpler model that may generalize better (higher bias, lower variance).
- Very effective in high-dimensional spaces.
- Memory efficient, as it only uses the support vectors for its decision boundary.
- Versatile: can be adapted for non-linear problems using the Kernel Trick.
- Can be slow to train on very large datasets.
- Performance is highly sensitive to the choice of the `C` hyperparameter.
- Does not directly provide probability estimates.
Strengths
Weaknesses
What's Next? The SVM's Magic Trick
We've built a powerful linear classifier. But what if our data is not linearly separable? What if the optimal decision boundary is a circle?
This is where the true power of SVMs comes from. Instead of creating a more complex model, the SVM uses a mathematical "hack" to make the *data* itself look simpler.
In the next lesson, we will explore the **Kernel Trick**, a method that allows the SVM to implicitly project the data into a much higher dimension where it *becomes* linearly separable, allowing it to create incredibly complex, non-linear decision boundaries with ease.