Chapter 13 Hypotheis Testing

13.1 Concepts

13.1.1 Significance Level (α)

Definition: The significance level (α) is the probability threshold used in hypothesis testing to determine whether to reject the null hypothesis. It represents the maximum probability of incorrectly rejecting the null hypothesis when it is actually true.

Commonly Used Values:

Typical Value: α is commonly set at 0.05, meaning there is a 5% chance of incorrectly rejecting the null hypothesis.
Other Values: Researchers may choose α levels such as 0.01 or 0.10 depending on the study’s requirements and the desired balance between Type I and Type II errors.

Interpretation:

If the calculated p-value is less than or equal to α, the results are considered statistically significant.
If the p-value is greater than α, the results are not statistically significant, suggesting that the null hypothesis cannot be rejected.

13.1.2 P-Value

Definition: The p-value is the probability of obtaining test results at least as extreme as the observed results, under the assumption that the null hypothesis is true. It measures the strength of evidence against the null hypothesis.

Key Points:

Lower p-value: Indicates stronger evidence against the null hypothesis. A small p-value (typically ≤ α) suggests that the observed results are unlikely to occur if the null hypothesis is true, leading to rejection of the null hypothesis.
Higher p-value: Indicates weaker evidence against the null hypothesis. A larger p-value (typically > α) suggests that the observed results are reasonably likely to occur even if the null hypothesis is true, leading to failure to reject the null hypothesis.

Interpretation:

p ≤ α: The results are statistically significant, suggesting that the observed effect is unlikely due to chance alone.
p > α: The results are not statistically significant, suggesting that the observed effect could reasonably occur due to chance.

Type I and Type II errors are concepts used in hypothesis testing and statistical decision-making, describing the errors that can occur when making conclusions about the population based on sample data. Here’s an explanation of each:

13.1.3 Type I Error

Definition: A Type I error occurs when the null hypothesis (H₀) is incorrectly rejected, even though it is actually true. In other words, it represents the situation where the researcher concludes there is an effect or relationship when, in fact, there is no such effect or relationship in the population.

Probability and Significance Level: - The probability of committing a Type I error is equal to the significance level (α) chosen for the hypothesis test.

For example, if α is set at 0.05, there is a 5% chance of mistakenly rejecting the null hypothesis when it is true.

Type I error

This happens when we reject the null hypothesis when it should not be rejected.
Type I error rate is the probability when Type I error happens, also known as significance level, or $\alpha$.
A common value for alpha is 0.05.

13.1.4 Type II Error

Definition: A Type II error occurs when the null hypothesis (H₀) is incorrectly not rejected, even though it is false. It represents the situation where the researcher fails to detect an effect or relationship that actually exists in the population.

Probability and Power: - The probability of committing a Type II error is denoted as β (beta). - Power (1 - β) represents the probability of correctly rejecting the null hypothesis when it is false. - Type II error rate is influenced by factors such as sample size, effect size, and variability in the data.

Type II error

This happens when we fail to reject the null hypothesis when it should be rejected.
Type II error rate is also known as $\beta$.

13.1.5 Relationship Between Type I and Type II Errors

Inverse Relationship: As the significance level (α) decreases (making Type I errors less likely), the probability of committing a Type II error (β) tends to increase, and vice versa.
Balancing Act: Researchers aim to strike a balance between Type I and Type II errors depending on the context and consequences of each error type.
Power Analysis: Conducted to determine an appropriate sample size to minimize both types of errors and maximize the likelihood of detecting real effects.

13.1.6 Importance in Research and Decision Making

Statistical Rigor: Understanding and controlling Type I and Type II error rates are essential for maintaining the integrity and reliability of research findings.
Impact: Errors can have significant implications in fields such as medicine, psychology, economics, and policy-making, influencing decisions based on study results.

In summary, Type I and Type II errors are critical concepts in hypothesis testing, highlighting the trade-offs and risks involved in drawing conclusions from sample data about the population. Proper consideration and calculation of these errors are vital for ensuring valid and meaningful research outcomes.

13.1.7 Statistical power

Statistical power refers to the probability that a hypothesis test correctly rejects the null hypothesis when it should be rejected. It is denoted as $1 - \beta$, where $\beta$ is the probability of a Type II error (failing to reject the null hypothesis when it is false).

A commonly accepted level of statistical power is 0.80, which corresponds to a Type II error rate (β) of 0.20.

Achieving sufficient statistical power is crucial for obtaining reliable and meaningful results in research. Sample size plays a critical role in determining statistical power.

For instance, when comparing two means, the calculation of statistical power can be based on factors such as the significance level (α), effect size (δ), and sample size (n).

In summary, statistical power measures the ability of a study to detect a true effect or relationship, and it depends on various factors that should be carefully considered during the design and interpretation of experiments.

The formula for calculating statistical power is given by:

\[ \text{Power} = 1 - \beta = 1 - P(\text{Type II Error}) = P(\text{Reject } H_0 | H_0 \text{ is false}) \]

where: $ : $
$ H_0 : $

Increasing sample size or effect size generally increases statistical power, while reducing the significance level increases the probability of Type II Error.

As sample size increases, the statistical power increases. Therefore, for our test to have desirable statistical power (usually 0.80), we want to estimate the minimum sample size required.

13.2 Null Hypothesis

The debate between Fisher and Neyman on the null hypothesis in causal inference revolves around different interpretations and implications of the null hypothesis in the context of randomized experiments.

13.2.1 Fisher’s Sharp Null Hypothesis

Sharp Null Hypothesis: Fisher’s sharp null hypothesis is the assertion that every single unit in the population has a treatment effect of zero.
- Implication: Under this null hypothesis, the treatment has no effect whatsoever on any unit. This is a very strong statement, implying that the treatment effect is exactly zero for all individuals.
- Testing: This allows for a precise and exact test of the null hypothesis because it specifies the potential outcomes for every unit. In this scenario, you can use randomization tests to calculate exact p-values by comparing observed outcomes to those expected under complete nullification of treatment effects.

13.2.2 Neyman’s Null Hypothesis

Average Treatment Effect (ATE) Null Hypothesis: Neyman, in contrast, proposed a weaker form of the null hypothesis, which asserts that the average treatment effect (ATE) across all units is zero.
- Implication: This means that, on average, the treatment does not have an effect, but it allows for the possibility that some units could have positive treatment effects while others have negative effects, as long as they cancel out in the aggregate.
- Testing: Testing this hypothesis typically involves estimating the average treatment effect and assessing its statistical significance. This approach does not specify the exact potential outcomes for each unit, making it more general but less powerful for exact testing.

13.2.3 Key Differences

Stringency:
- Fisher’s sharp null hypothesis is more stringent because it makes a precise statement about every unit’s treatment effect being zero.
- Neyman’s null hypothesis is less stringent because it only concerns the average effect across the population.
Testing Methodologies:
- Under Fisher’s sharp null, one can use permutation or randomization tests to obtain exact p-values, as the null hypothesis specifies the exact distribution of outcomes under no treatment effect.
- Under Neyman’s null, one typically relies on estimations and asymptotic properties to test the significance of the estimated ATE. This involves confidence intervals and standard errors.
Implications for Experimental Design:
- Fisher’s approach allows for stronger conclusions in terms of causality for each unit but requires stronger assumptions.
- Neyman’s approach provides a broader inference about the average effect, which is often more realistic in practical scenarios where treatment effects can vary across units.

13.2.4 Example to Illustrate the Difference

Suppose we conduct an experiment to test the effect of a new drug on blood pressure. We have two groups: a treatment group and a control group.

Fisher’s Sharp Null Hypothesis: The null hypothesis here would state that the new drug has no effect on blood pressure for every individual in the treatment group. This means if an individual’s blood pressure would be 120 without the drug, it remains 120 with the drug. If we observe a difference between the treatment and control groups, we can use randomization tests to see if this difference is likely to occur by chance under the sharp null.
Neyman’s ATE Null Hypothesis: The null hypothesis here would state that the average change in blood pressure due to the drug across all individuals is zero. This allows for some individuals to experience a decrease in blood pressure and others an increase, as long as these changes average out to zero. Here, we would estimate the ATE and test if it is significantly different from zero using statistical inference methods.

13.2.5 Conclusion

The debate between Fisher and Neyman highlights a fundamental difference in how causal effects are conceptualized and tested in statistics. Fisher’s sharp null hypothesis allows for precise testing with exact p-values but requires stronger assumptions, while Neyman’s ATE null hypothesis is more flexible and realistic but relies on estimation and inference methods that are less precise in defining individual treatment effects. Understanding both approaches provides a comprehensive view of hypothesis testing in the context of causal inference.

13.3 Permutation Tests

A permutation test (also known as a randomization test or re-randomization test). This method is used in econometrics and statistics to test the null hypothesis when the assumptions required for traditional parametric tests (like the t-test) may not hold, particularly in the context of small sample sizes or when the distribution of the test statistic under the null hypothesis is complex or unknown.

13.3.1 When and Why to Use Permutation Tests

Non-Parametric Nature:
- When: Permutation tests are useful when the data do not necessarily meet the assumptions of parametric tests, such as normality or homogeneity of variance.
- Why: Because they are non-parametric, permutation tests do not rely on the underlying distribution of the data, making them more robust in certain situations.
Small Sample Sizes:
- When: They are particularly valuable when dealing with small sample sizes where the Central Limit Theory may not apply, and thus, the sampling distribution of the test statistic under the null hypothesis is not well-approximated by a normal distribution.
- Why: In small samples, traditional methods like the t-test may not be reliable. Permutation tests use the actual data to construct the distribution of the test statistic, which can provide a more accurate p-value.
Exact Test:
- When: When you need an exact test rather than an approximate one.
- Why: Permutation tests generate the exact distribution of the test statistic under the null hypothesis by considering all possible reassignments of treatment labels, ensuring an accurate p-value.
Complex Experimental Designs:
- When: In complex experimental designs where the structure of the data or the treatment assignment mechanism does not fit well with the assumptions of standard tests.
- Why: Permutation tests are flexible and can be adapted to a wide variety of experimental designs and data structures.

13.3.2 How to Perform a Permutation Test

Here’s a step-by-step outline of the permutation test procedure:

Calculate the Original Test Statistic:
- Compute the test statistic (e.g., the difference in means between treatment and control groups) for the observed data.
Drop the Treatment Variable:
- Remove the treatment labels from the data, essentially pooling all the data together.
Reassign Treatment Labels:
- Randomly reassign the treatment labels to the data, ensuring the same number of treatment and control units as in the original data.
Calculate the Test Statistic for the New Assignment:
- Compute the test statistic for this new random assignment of treatment labels.
Repeat the Process:
- Repeat the re-randomization and calculation of the test statistic many times (e.g., 1,000 or more) to build a distribution of the test statistic under the null hypothesis.
Create the Empirical Distribution:
- Collect all the computed test statistics from the repeated random assignments to form an empirical distribution of the test statistic under the null hypothesis.
Calculate the Empirical P-Value:
- Compare the original test statistic to this empirical distribution. The p-value is the proportion of test statistics in the empirical distribution that are as extreme as, or more extreme than, the original test statistic.

13.3.3 Example Calculation

Observed Data: Suppose you have two groups, treatment ($Y_T$) and control ($Y_C$), and you calculate the observed difference in means, $\Delta_{obs} = \bar{Y}_T - \bar{Y}_C$.
Reassign Labels: Randomly shuffle the combined data and reassign the treatment and control labels.
Compute New Statistic: Calculate the new difference in means for this re-randomized data, $\Delta_{rand}$.
Repeat: Perform steps 2 and 3, say 1,000 times, to get a distribution of $\Delta_{rand}$.
Compare: Determine where $\Delta_{obs}$ falls within the distribution of $\Delta_{rand}$. If $\Delta_{obs}$ is in the extreme tails of this distribution, it suggests that the observed effect is unlikely to have occurred by random chance.
P-Value: Calculate the p-value as the proportion of $\Delta_{rand}$ values that are as extreme or more extreme than $\Delta_{obs}$.

13.3.4 Conclusion

Permutation tests are a powerful tool for testing hypotheses in situations where traditional assumptions may not hold. By using the actual data to generate the null distribution of the test statistic, permutation tests provide a robust, non-parametric method for hypothesis testing, ensuring accurate p-values even in complex or small-sample scenarios.

13.4 Fischer’s Exact Test

Fisher’s Exact Test is a statistical test used to determine if there are nonrandom associations between two categorical variables. It is particularly useful when sample sizes are small, and the assumptions of the chi-square test (like the expected frequency in each cell being at least 5) are not met. The test is named after the famous statistician Ronald A. Fisher.

13.4.1 When to Use Fisher’s Exact Test

Small Sample Sizes: Fisher’s Exact Test is often used when dealing with small sample sizes, where the chi-square test may not be appropriate.
Categorical Data: It is used for categorical data to test for independence between two variables.
2x2 Contingency Tables: While the test can be extended to larger tables, it is most commonly applied to 2x2 contingency tables.

13.4.2 How Fisher’s Exact Test Works

13.4.2.1 Example

Consider the following 2x2 contingency table:

	Treatment	Control	Total
Success	a	b	a + b
Failure	c	d	c + d
Total	a + c	b + d	n

Where: - $a$: Number of successes in the treatment group - $b$: Number of successes in the control group - $c$: Number of failures in the treatment group - $d$: Number of failures in the control group - $n$: Total number of observations

13.4.2.2 Calculating the P-Value

Fisher’s Exact Test calculates the exact probability of obtaining a distribution of values in the contingency table given the observed marginal totals. The p-value is computed by summing the probabilities of all possible tables that have the same row and column totals as the observed table and have a test statistic as extreme as, or more extreme than, the observed one.

The probability of any particular table is given by the hypergeometric distribution:

\[ P = \frac{\binom{a+b}{a} \binom{c+d}{c}}{\binom{n}{a+c}} \]

Where $\binom{n}{k}$ is the binomial coefficient, representing the number of ways to choose $k$ successes out of $n$ trials.

13.4.3 Step-by-Step Example

Let’s say we have the following data from a clinical trial:

	Treatment	Control
Improved	8	2
Not Improved	1	5

We want to test if there is a significant association between the treatment and the improvement.

Construct the Contingency Table:

Treatment Control Total

Improved 8 2 10

Not Improved 1 5 6

Total 9 7 16
Calculate the P-Value:

The p-value for Fisher’s Exact Test is the sum of the probabilities of all tables that are as extreme as or more extreme than the observed table, given the marginal totals.

For the given table:

\[ P = \frac{\binom{10}{8} \binom{6}{1}}{\binom{16}{9}} = \frac{45 \times 6}{11440} \approx 0.0236 \]
Interpret the Result:

If the p-value (0.0236) is less than the chosen significance level (commonly 0.05), we reject the null hypothesis and conclude that there is a significant association between the treatment and the improvement.

	Treatment	Control	Total
Improved	8	2	10
Not Improved	1	5	6
Total	9	7	16

13.4.4 Conclusion

Fisher’s Exact Test is a powerful tool for analyzing contingency tables, especially when sample sizes are small. It provides an exact p-value for the test of independence between two categorical variables, making it a preferred choice when the assumptions of the chi-square test are not satisfied. By using this test, researchers can make accurate inferences about the relationships between categorical variables even in studies with limited data.