Lesson 1.0: The ML Landscape - Thinking in Data
Welcome to your journey into Machine Learning. Before we write a single line of code, we must learn to see the world like a data scientist. This foundational lesson introduces the absolute core vocabulary (Features, Labels) and the three main 'paradigms' of learning: Supervised, Unsupervised, and Reinforcement Learning.
Part 1: The Anatomy of a Machine Learning Problem
Let's start with a concrete, relatable problem. Imagine you are a manager at a quantitative hedge fund. At the end of the year, you have to decide on bonuses. You have a spreadsheet of data for all your junior quants from the past year.
| Employee ID | Performance Score (1-10) | Projects Completed | Team Rating (1-5) | Bonus ($) |
|---|---|---|---|---|
| 101 | 8.5 | 6 | 4.2 | 50,000 |
| 102 | 6.2 | 3 | 4.8 | 25,000 |
| ... | ... | ... | ... | ... |
This simple spreadsheet contains all the core components of an ML problem.
Core Vocabulary
- Data / Dataset: The entire spreadsheet. It's our collection of observations.
- Instance / Sample / Observation: A single row in the spreadsheet (e.g., the data for employee 101).
- Features (X): The input columns we use to make a prediction. They are the 'clues'. Here, our features are `Performance Score`, `Projects Completed`, and `Team Rating`. Also called predictors or independent variables.
- Label (y): The output column we are trying to predict. It is the 'answer'. Here, our label is the `Bonus ($)`. Also called the target or dependent variable.
The fundamental goal of most ML models is to learn a function, , that maps the features () to the label (). We want to find a function such that .
Part 2: The Three Flavors of Learning
Machine Learning is a vast field, but almost every algorithm falls into one of three main categories, defined by the type of problem they solve and the data they use.
Our Quant Bonus Example: Since we have the `Bonus ($)` column (the label), we can use supervised learning to build a model that predicts the bonus for a new employee based on their performance metrics.
There are two main types of Supervised Learning:
Regression
Predicting a continuous numerical value.
Question: "What will the *exact bonus amount* be?"
Examples: Predicting a stock price, forecasting demand, estimating the temperature.
Classification
Predicting a discrete category or class.
Question: "Will the bonus be 'High' or 'Low'?" (We would create these categories from the bonus amount).
Examples: Spam vs. Not Spam, Stock Up vs. Down, Fraudulent vs. Legitimate transaction.
Our Quant Bonus Example: Imagine we lost the `Bonus ($)` column. What could we still do? We could ask the algorithm: "Based on the performance metrics, are there natural groups or 'types' of employees in my data?"
The two main types of Unsupervised Learning are:
Clustering
Grouping similar data points together.
Question: "Can you find 3 distinct groups of employees based on their work styles?" The result might be "The Grinders," "The Team Players," and "The High Potentials."
Examples: Customer segmentation, grouping similar stocks, identifying gene families.
Dimensionality Reduction
Simplifying the data by reducing the number of features.
Question: "Can you combine `Performance Score` and `Projects Completed` into a single, new 'Productivity' feature?"
Examples: Data compression, feature extraction for visualization, noise reduction.
Our Quant Bonus Example: The problem is no longer about prediction, but about **action**. An RL agent would be a 'manager'.
- Agent: The manager.
- Environment: The team of quants and the available projects.
- Action: Assigning a specific project to a specific quant.
- Reward: The increase in the quant's future performance (and thus bonus).
The agent's goal is to learn a **policy** (a decision-making strategy) for assigning projects that maximizes the total rewards (team performance) over time.
Classic Examples: Training an AI to play chess, a robot to walk, or an algorithm to optimally execute a large stock trade to minimize market impact.
- Supervised Learning: "Given this data, predict this specific value/category." (You have the answer key).
- Unsupervised Learning: "Given this data, find its hidden structure." (You don't have an answer key).
- Reinforcement Learning: "Given this situation, what is the best action to take next?" (You learn the answers through trial and error).
You can identify the type of ML problem by the question it asks:
What's Next? The Most Important Problem in ML
We've now established the core vocabulary and the fundamental challenge of building a model. We have a way to think and a way to diagnose.
It's time to stop talking and start doing. In the next lesson, we will get our hands dirty with our first two intuitive models. We will explore how K-Nearest Neighbors (KNN) works for classification and take a closer look at the mechanics of Simple Linear Regression, setting the stage for learning how to actually *train* them in the lessons that follow.