Lesson 5.8: A Tour of Other Clustering Methods
K-Means is a powerful and fast clustering algorithm, but its assumptions can be limiting. This lesson provides a brief overview of two popular alternatives: DBSCAN, which excels at finding clusters of arbitrary shapes and identifying noise, and Hierarchical Clustering, which builds a 'family tree' of data points.
Part 1: The Limitations of K-Means
Before exploring alternatives, let's recap the two major weaknesses of K-Means that motivate the search for other methods:
- It assumes spherical clusters: Because K-Means is based on minimizing the distance to a central point (the centroid), it implicitly assumes that clusters are convex and roughly spherical. It fails badly on complex shapes like rings, moons, or long, thin structures.
- It assigns every point to a cluster: K-Means forces every single point, including outliers, into a cluster. A few distant outliers can significantly drag a centroid away from the true center of its cluster.
Part 2: DBSCAN - Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a completely different approach. It defines clusters as continuous regions of high point density, separated by regions of low density.
The Core Analogy: Finding 'Islands' of Data
Imagine your data points are islands in an ocean.
- Hyperparameters: Instead of choosing 'K', you choose two other parameters:
- `eps` (epsilon): The maximum distance between two points for them to be considered "neighbors" (like a short ferry ride).
- `min_samples` (MinPts): The minimum number of points required to form a dense "core" region.
- The Algorithm:
- It starts at a random point. If there are at least `min_samples` within its `eps` radius, it marks this as a **core point** and starts a new cluster.
- It then expands this cluster by finding all reachable neighbors, and the neighbors of those neighbors, until it can't find any more core points.
- Points that are reachable from a core point but don't have enough neighbors themselves are called **border points**.
- Any point that is not a core point and not a border point is classified as **noise**.
Pros
- Does not require you to specify the number of clusters.
- Can find arbitrarily shaped clusters.
- Is robust to outliers, which it identifies as noise.
Cons
- Can struggle with clusters of varying densities.
- Performance can be sensitive to the choice of `eps` and `min_samples`.
- The concept of "density" becomes less meaningful in high-dimensional spaces.
Part 3: Hierarchical Clustering - Building a 'Family Tree'
Hierarchical clustering doesn't produce a single flat partitioning of the data. Instead, it builds a tree of clusters, known as a **dendrogram**. This is useful when you believe there is a nested structure in your data.
There are two main approaches:
- Agglomerative (Bottom-up): The most common approach. It starts with each data point as its own cluster and iteratively merges the two closest clusters until only one cluster (containing all the data) remains.
- Divisive (Top-down): Starts with all data in one cluster and recursively splits it until each data point is its own cluster.
The Core Analogy: Building a Corporate Org Chart
Imagine you have data on all employees in a company.
Agglomerative hierarchical clustering would work like this:
- Start with every employee as their own individual "cluster."
- Find the two "closest" employees (e.g., two junior analysts who work together) and merge them into a "team" cluster.
- Find the next two closest clusters (this could be two other employees, or it could be merging a single employee with the team you just formed) and merge them.
- Continue this process. Teams merge into departments. Departments merge into divisions. Divisions merge into the whole company.
The result is a dendrogram, a tree diagram that shows the entire hierarchy of merges. The user can then "cut" the dendrogram at a certain height to get a specific number of flat clusters, if desired.
The key choice in this algorithm is how to measure the distance between two clusters. Common methods include:
- Single Linkage: The distance is the minimum distance between any two points in the two clusters.
- Complete Linkage: The distance is the maximum distance between any two points in the two clusters.
- Average Linkage: The distance is the average distance between all pairs of points across the two clusters.
- Ward's Method: Merges the two clusters that result in the minimum increase in the total within-cluster variance. This is often the default choice as it tends to produce well-balanced clusters.
Congratulations! You Have Completed Module 5
You have now mastered the two fundamental tasks of Unsupervised Learning. You know how to find groups in data using clustering algorithms like K-Means, and you know how to simplify complex, high-dimensional data using dimensionality reduction techniques like PCA.
This completes a critical part of your machine learning toolkit, giving you the ability to find structure and meaning in data even when there is no "right answer" to guide you.
What's Next in Your Journey?
In all our modules so far, we have largely ignored one of the most important features of financial and economic data: the element of **time**. We have treated our data points as independent observations. But what if the order of the data matters?
In **Module 6: The Rhythm of the Market: Time-Series Forecasting**, we will enter the world of sequential data. We will learn about stationarity, autocorrelation, and the classic ARIMA and GARCH models used to forecast the future.