Lesson 3.10: The Verdict: p-values and Critical Regions

We have our hypotheses and a test statistic. How do we make the final call? This lesson covers the two equivalent methods for reaching a statistical verdict: the classical Critical Value approach and the modern, more informative p-value approach. Mastering this is the key to reading any statistical output.

Part 1: The Setup for the Verdict

Let's review our situation from the courtroom analogy. We have:

A **Null Hypothesis** to challenge (e.g., $H_0: \beta_1 = 0$ , "the suspect is innocent").
A **Significance Level**, $\alpha$ (e.g., 0.05), which is our standard for "beyond a reasonable doubt."
A **Test Statistic** calculated from our evidence (e.g., $t_{stat} = 2.5$ ).
The known probability distribution of that statistic *assuming $H_0$ is true* (e.g., $t \sim t_{(n-k-1)}$ ).

We need to answer: "Is our evidence ( $t_{stat}=2.5$ ) extreme enough to reject the presumption of innocence ( $H_0$ )?"

There are two ways to answer this.

Part 2: Method 1: The Critical Value Approach

This is the classical, visual approach to hypothesis testing.

The Analogy: A Line in the Sand

Before looking at the evidence ( $t_{stat}$ ), the judge draws a "line in the sand" on the probability distribution. This line is the **Critical Value**.

The area beyond this line is the **Rejection Region**. If our evidence falls into this region, it's considered "extreme enough" to reject the null hypothesis.

The Decision Process

Choose $\alpha$ : Let's use $\alpha = 0.05$ for a two-sided test.
Find Critical Value: We look up the t-value that leaves $\alpha/2 = 0.025$ in each tail of the t-distribution (e.g., with 30 df). This gives us our "lines in the sand": $t_{crit} = \pm 2.042$ .
Calculate Test Statistic: We compute our statistic from the data, e.g., $t_{stat} = 2.5$ .
Make the Decision: We check if our statistic crosses the line.
- Since $|2.5| > 2.042$ , our statistic falls in the rejection region.
- **Verdict: Reject H₀.**

Part 3: Method 2: The p-value Approach

The critical value method works, but it's inflexible. The modern, more informative approach is to calculate a **p-value**.

Definition: The p-value

The **p-value** is the probability of observing a test statistic **at least as extreme** as the one you actually calculated, *assuming the null hypothesis is true*.

The Analogy: The Surprise-o-Meter

Think of the p-value as a "surprise index" that ranges from 0 to 1.

**High p-value (e.g., 0.80):** "Not surprising at all. If the null were true, we'd see data like this 80% of the time." → We don't doubt H₀.
**Low p-value (e.g., 0.01):** "Very surprising! If the null were true, this data is a 1-in-100 long shot. It's more plausible that the null is wrong." → We doubt H₀.

The Decision Process

The p-value Decision Rule

If the p-value is low, the null must go.

Choose $\alpha$ : Let's use $\alpha = 0.05$ .
Calculate Test Statistic: We get $t_{stat} = 2.5$ .
Calculate p-value: We find the probability of being "more extreme" than our statistic. For a two-sided test, this is the area in the tails beyond $\pm 2.5$ .
$p\text{-value} = P(|t_{df=30}| > 2.5) \approx 0.018$
Make the Decision: We compare our "surprise level" to our "doubt threshold."
- Since $p\text{-value} (0.018) < \alpha (0.05)$ , our data is "too surprising" to be consistent with the null hypothesis.
- **Verdict: Reject H₀.** (The same verdict as before).

Part 4: Critical Misinterpretations of the p-value

What the p-value is NOT

The p-value is the most misinterpreted number in all of science. Do not make these mistakes.

FALLACY #1: The Prosecutor's Fallacy. A p-value of 0.02 does NOT mean "there is a 2% chance the null hypothesis is true." It is $P(\text{Data} | H_0)$ , not $P(H_0 | \text{Data})$ .
FALLACY #2: The Evidence of Absence. A large p-value (e.g., 0.70) does NOT "prove the null hypothesis." It simply means you failed to find sufficient evidence against it. Your test may have just been weak (low power).
FALLACY #3: The Effect Size Fallacy. A tiny p-value (e.g., &lt 0.001) does NOT mean the effect is large or important. With enough data, even a minuscule, practically useless effect can be "statistically significant."

What's Next? The Theory of the 'Best' Test

We've mastered the practical mechanics of reaching a verdict. But this raises a deeper question: How do we know that the t-test or the F-test is the *best possible* test we could have used?

Is there a way to find the "most powerful" test for a given hypothesis—the test that has the highest probability of correctly convicting a guilty party (Power = 1-β) for a fixed level of false positives (α)?

In the next lesson, we will explore the elegant theory behind this question with the **Neyman-Pearson Lemma**.

Up Next: The Theory of Optimal Tests: Neyman-Pearson Lemma