Lesson 2.5: From Regression to Classification: The Logic of Logistic Regression

We've mastered predicting numbers. But what about predicting categories (Yes/No, Buy/Sell)? This lesson bridges the gap, showing how we can adapt our linear model using the elegant Sigmoid function to output probabilities. We'll explore why OLS fails for classification and how Maximum Likelihood Estimation provides the correct framework.

Part 1: The Problem with a Straight Line

Let's say we want to predict if a loan applicant will default (1) or not (0) based on their credit score. This is a **binary classification** problem.

What happens if we naively try to use our trusty Linear Regression model? The model will try to fit a straight line through a set of data points that are all either 0 or 1.

The 'Speedometer for a Light Switch' Problem

Using linear regression for classification is like trying to use a car's speedometer to describe a light switch. A light switch is either ON or OFF. A speedometer can read any number.

  • The Output is Unconstrained: Our linear model y^=Xβ\hat{y} = \mathbf{X}\bm{\beta} can predict any value: 0.5, 1.2, -0.3, etc. This is nonsensical. We need our output to be a **probability**, strictly between 0 and 1.
  • The Errors are Not Normal: The error term (ϵ\epsilon) can only take on two values for any given X, violating the normality assumption needed for valid OLS inference.

We need a new tool. We need a function that takes the unbounded output of our linear model and "squashes" it into the [0, 1] range to represent a probability.

Part 2: The Solution - The Sigmoid (Logistic) Function

The **Sigmoid function** is the perfect tool for this job. It's an S-shaped curve that takes any real number and maps it to a value between 0 and 1.

Definition: The Sigmoid Function σ(z)

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Where zz is the output of our linear model, z=Xβz = \mathbf{X}\bm{\beta}.

The Logistic Regression Model

The Logistic Regression model predicts the **probability** of the outcome being 1 (Y=1Y=1) by feeding the linear model's output through the sigmoid function:

p^=P(Y=1X)=σ(Xβ)=11+eXβ\hat{p} = P(Y=1 | \mathbf{X}) = \sigma(\mathbf{X}\bm{\beta}) = \frac{1}{1 + e^{-\mathbf{X}\bm{\beta}}}

We then make our final classification based on a threshold (usually 0.5): If p^>0.5\hat{p} > 0.5, we predict 1 (Default). If p^0.5\hat{p} \le 0.5, we predict 0 (No Default).

Part 3: A New Objective - Maximizing Likelihood

Since our output is a probability, not a continuous value, minimizing the Sum of Squared Residuals (SSR) is no longer the right objective. Instead, we use **Maximum Likelihood Estimation (MLE)**.

The goal of MLE is to find the parameters (β\bm{\beta}) that **maximize the probability (the likelihood) of observing the actual data we collected**.

Deriving the Log-Likelihood for Logistic Regression

For a single observation yiy_i (which is either 0 or 1), the probability is:

P(yixi)=p^iyi(1p^i)1yiP(y_i | \mathbf{x}_i) = \hat{p}_i^{y_i} (1 - \hat{p}_i)^{1 - y_i}

The total likelihood for all nn independent observations is the product:

L(β)=i=1np^iyi(1p^i)1yiL(\bm{\beta}) = \prod_{i=1}^n \hat{p}_i^{y_i} (1 - \hat{p}_i)^{1 - y_i}

Taking the log gives the **Log-Likelihood**, which is easier to work with:

(β)=i=1n[yilog(p^i)+(1yi)log(1p^i)]\ell(\bm{\beta}) = \sum_{i=1}^n \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

The algorithm uses Gradient Ascent (or similar numerical methods) to find the β\bm{\beta} that maximizes this log-likelihood function. In practice, optimizers *minimize* a loss function, so they simply minimize the **negative** of the log-likelihood. This is called the **Log Loss** or **Binary Cross-Entropy Loss**.

Part 4: Interpreting the Coefficients - The Log-Odds

Unlike linear regression, the coefficients in logistic regression are not as straightforward to interpret. A one-unit change in XjX_j does not lead to a βj\beta_j change in the probability.

To find a linear relationship, we must transform our probability using the **logit** function (the inverse of the sigmoid).

logit(p)=ln(p1p)\text{logit}(p) = \ln\left(\frac{p}{1-p}\right)

The term p/(1p)p/(1-p) is the **odds** of the event occurring. The logit is therefore the **log-odds**.

The magic of logistic regression is that the log-odds are linear in the parameters:

ln(p^1p^)=Xβ\ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = \mathbf{X}\bm{\beta}

Interpretation: A one-unit change in XjX_j is associated with a βj\beta_j change in the **log-odds** of the event. To make this more intuitive, we often exponentiate the coefficient, eβje^{\beta_j}, which gives us the **odds ratio**.

Part 5: Python Implementation

Logistic Regression in Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 1. Generate some sample data
np.random.seed(42)
n_samples = 500
credit_score = np.random.randint(300, 850, n_samples)
income = np.random.randint(20000, 150000, n_samples)
# The probability of default is higher for lower credit scores and lower incomes
prob_default = 1 / (1 + np.exp(-(-8 + 0.01*credit_score + 0.00002*income)))
# Generate the binary outcome
default = (np.random.rand(n_samples) < prob_default).astype(int)

df = pd.DataFrame({'credit_score': credit_score, 'income': income, 'default': default})

# 2. Prepare Data
X = df[['credit_score', 'income']]
y = df['default']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 3. Scale Features (Important!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Fit the Logistic Regression Model
# Regularization is on by default, C=1.0 is the inverse of lambda.
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# 5. Make Predictions
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1] # Get probability of class 1

# 6. Evaluate the Model
print("--- Model Evaluation ---")
print(classification_report(y_test, y_pred))

# 7. Interpret Coefficients (Odds Ratios)
coefs = pd.DataFrame(log_reg.coef_[0], X.columns, columns=['Coefficient'])
coefs['Odds Ratio'] = np.exp(coefs['Coefficient'])
print("\n--- Coefficient Interpretation ---")
print(coefs)
# An odds ratio > 1 means the feature increases the odds of default.
# An odds ratio < 1 means the feature decreases the odds of default.

What's Next? Advanced Classification Metrics

We've successfully built our first classification model and seen how to interpret its coefficients. The `classification_report` gives us key metrics like Precision and Recall.

But how do we evaluate the model's performance across all possible probability thresholds, not just 0.5? How can we visualize the tradeoff between catching true positives and avoiding false alarms?

In the next lesson, we will dive into advanced classification metrics, exploring the **Precision-Recall Tradeoff** and the all-important **Receiver Operating Characteristic (ROC) Curve** and its associated **Area Under the Curve (AUC)** score.