Lesson 3.10: The Verdict: p-values and Critical Regions

We have our hypotheses and a test statistic. How do we make the final call? This lesson covers the two equivalent methods for reaching a statistical verdict: the classical Critical Value approach and the modern, more informative p-value approach. Mastering this is the key to reading any statistical output.

Part 1: The Setup for the Verdict

Let's review our situation from the courtroom analogy. We have:

  1. A **Null Hypothesis** to challenge (e.g., H0:β1=0H_0: \beta_1 = 0, "the suspect is innocent").
  2. A **Significance Level**, α\alpha (e.g., 0.05), which is our standard for "beyond a reasonable doubt."
  3. A **Test Statistic** calculated from our evidence (e.g., tstat=2.5t_{stat} = 2.5).
  4. The known probability distribution of that statistic *assuming H0H_0 is true* (e.g., tt(nk1)t \sim t_{(n-k-1)}).

We need to answer: "Is our evidence (tstat=2.5t_{stat}=2.5) extreme enough to reject the presumption of innocence (H0H_0)?"

There are two ways to answer this.

Part 2: Method 1: The Critical Value Approach

This is the classical, visual approach to hypothesis testing.

The Analogy: A Line in the Sand

Before looking at the evidence (tstatt_{stat}), the judge draws a "line in the sand" on the probability distribution. This line is the **Critical Value**.

The area beyond this line is the **Rejection Region**. If our evidence falls into this region, it's considered "extreme enough" to reject the null hypothesis.

The Decision Process
  1. Choose α\alpha: Let's use α=0.05\alpha = 0.05 for a two-sided test.
  2. Find Critical Value: We look up the t-value that leaves α/2=0.025\alpha/2 = 0.025 in each tail of the t-distribution (e.g., with 30 df). This gives us our "lines in the sand": tcrit=±2.042t_{crit} = \pm 2.042.
  3. Calculate Test Statistic: We compute our statistic from the data, e.g., tstat=2.5t_{stat} = 2.5.
  4. Make the Decision: We check if our statistic crosses the line.
    • Since 2.5>2.042|2.5| > 2.042, our statistic falls in the rejection region.
    • **Verdict: Reject H₀.**

Part 3: Method 2: The p-value Approach

The critical value method works, but it's inflexible. The modern, more informative approach is to calculate a **p-value**.

Definition: The p-value

The **p-value** is the probability of observing a test statistic **at least as extreme** as the one you actually calculated, *assuming the null hypothesis is true*.

The Analogy: The Surprise-o-Meter

Think of the p-value as a "surprise index" that ranges from 0 to 1.

  • **High p-value (e.g., 0.80):** "Not surprising at all. If the null were true, we'd see data like this 80% of the time." → We don't doubt H₀.
  • **Low p-value (e.g., 0.01):** "Very surprising! If the null were true, this data is a 1-in-100 long shot. It's more plausible that the null is wrong." → We doubt H₀.
The Decision Process

The p-value Decision Rule

If the p-value is low, the null must go.

  1. Choose α\alpha: Let's use α=0.05\alpha = 0.05.
  2. Calculate Test Statistic: We get tstat=2.5t_{stat} = 2.5.
  3. Calculate p-value: We find the probability of being "more extreme" than our statistic. For a two-sided test, this is the area in the tails beyond ±2.5\pm 2.5.
    p-value=P(tdf=30>2.5)0.018p\text{-value} = P(|t_{df=30}| > 2.5) \approx 0.018
  4. Make the Decision: We compare our "surprise level" to our "doubt threshold."
    • Since p-value(0.018)<α(0.05)p\text{-value} (0.018) < \alpha (0.05), our data is "too surprising" to be consistent with the null hypothesis.
    • **Verdict: Reject H₀.** (The same verdict as before).

Part 4: Critical Misinterpretations of the p-value

What the p-value is NOT

The p-value is the most misinterpreted number in all of science. Do not make these mistakes.

  • FALLACY #1: The Prosecutor's Fallacy. A p-value of 0.02 does NOT mean "there is a 2% chance the null hypothesis is true." It is P(DataH0)P(\text{Data} | H_0), not P(H0Data)P(H_0 | \text{Data}).
  • FALLACY #2: The Evidence of Absence. A large p-value (e.g., 0.70) does NOT "prove the null hypothesis." It simply means you failed to find sufficient evidence against it. Your test may have just been weak (low power).
  • FALLACY #3: The Effect Size Fallacy. A tiny p-value (e.g., &lt 0.001) does NOT mean the effect is large or important. With enough data, even a minuscule, practically useless effect can be "statistically significant."

What's Next? The Theory of the 'Best' Test

We've mastered the practical mechanics of reaching a verdict. But this raises a deeper question: How do we know that the t-test or the F-test is the *best possible* test we could have used?

Is there a way to find the "most powerful" test for a given hypothesis—the test that has the highest probability of correctly convicting a guilty party (Power = 1-β) for a fixed level of false positives (α)?

In the next lesson, we will explore the elegant theory behind this question with the **Neyman-Pearson Lemma**.

Up Next: The Theory of Optimal Tests: Neyman-Pearson Lemma