Chapter 5 AB Testing
It is now widely accepted that the gold standard technique to compute the causal effect of a treatment (a drug, ad, product, …) on an outcome of interest (a disease, firm revenue, customer satisfaction, …) is AB testing, a.k.a. randomized experiments.
We randomly split a set of subjects (patients, users, customers, …) into a treatment and a control group and give the treatment to the treatment group. This procedure ensures that, ex-ante, the only expected difference between the two groups is caused by the treatment.
5.2 Concepts
Contamination:
One of the key assumptions in AB testing is that there is no contamination between treatment and control group.
Giving a drug to one patient in the treatment group does not affect the health of patients in the control group.
This might not be the case for example if we are trying to cure a contageous disease and the two groups are not isolated.
In the industry, frequent violations of the contamination assumption are network effects - my utility of using a social network increases as the number of friends on the network increases - and general equilibrium effects - if I improve one product, it might decrease the sales of another similar product.
Because of this reason, often experiments are carried out at a sufficiently large scale so that there is no contamination across groups, such as cities, states or even countries. Then another problem arises because of the larger scale: the treatment becomes more expensive. Giving a drug to 50% of patients in a hospital is much less expensive than giving a drug to 50% of cities in a country. Therefore, often only few units are treated but often over a longer period of time.
In these settings, a very powerful method emerged around 10 years age: Synthetic Control. The idea of synthetic control is to exploit the temporal variation in the data instead of the cross-sectional one (across time instead of across units).
This method is extremely popular in the industry - e.g. in companies like Google, Uber, Facebook, Microsoft, Amazon - because it is easy to interpret and deals with a setting that emerges often at large scales.
5.3 AI Summary
Studying A/B testing (also known as randomized controlled trials or RCTs) is a great way to understand experimental design in causal inference, particularly in contexts where random assignment is feasible and ethical considerations allow for manipulation of variables to observe their effects. Here’s a detailed overview to get you started:
5.3.1 A/B Testing (Randomized Controlled Trials)
A/B testing, or randomized controlled trials (RCTs), is an experimental design where participants or subjects are randomly assigned to different groups (treatment and control) to test the effectiveness of a particular intervention or treatment.
5.3.1.1 Key Concepts
- Random Assignment:
- Participants are assigned to either the treatment group (exposed to the intervention) or the control group (not exposed) through randomization. This helps ensure that any differences observed between the groups are due to the intervention rather than pre-existing differences.
- Controlled Environment:
- The experiment is conducted in a controlled environment where researchers can manipulate variables and minimize external influences that could affect the outcomes.
- Causal Inference:
- A/B testing allows researchers to make causal inferences about the effect of the intervention on the outcome variable. By comparing outcomes between the treatment and control groups, researchers can estimate the causal impact of the intervention.
5.3.1.2 Steps in A/B Testing
- Hypothesis Formulation:
- Define a clear hypothesis about the expected effect of the intervention on the outcome variable.
- Random Assignment:
- Randomly assign participants or subjects to the treatment and control groups. Randomization helps ensure that the groups are comparable on average, reducing the risk of bias.
- Implementation of Intervention:
- Implement the intervention or treatment with the treatment group while keeping conditions unchanged for the control group.
- Outcome Measurement:
- Measure the outcome of interest for both the treatment and control groups after the intervention. This could be metrics like conversion rates, satisfaction scores, or health outcomes.
- Statistical Analysis:
- Compare the outcomes between the treatment and control groups using statistical methods (e.g., t-tests, regression analysis) to determine if there is a significant difference attributable to the intervention.
- Interpretation of Results:
- Interpret the results to determine whether the intervention had a causal effect on the outcome variable. Consider factors such as statistical significance, effect size, and practical significance.
5.3.1.3 Advantages of A/B Testing
Causal Inference: Allows for strong causal claims about the impact of interventions.
Control Over Variables: Researchers have control over experimental conditions, minimizing confounding factors.
Versatility: Applicable across various fields including marketing, healthcare, education, and technology.
5.3.1.4 Example Application
Imagine a company wants to test the effectiveness of two different website layouts (A and B) on user engagement:
Hypothesis: Layout B will lead to higher user engagement compared to Layout A.
Random Assignment: Users visiting the website are randomly assigned to either see Layout A or Layout B.
Outcome Measurement: Engagement metrics such as click-through rates or time spent on the website are measured for both groups.
Analysis: Statistical tests are conducted to compare engagement metrics between Layout A and Layout B.
Conclusion: If Layout B shows significantly higher engagement metrics, the company may decide to implement Layout B on their website.
5.3.2 Concepts:
5.3.2.1 Effect Size
Definition: Effect size quantifies the magnitude of the difference or relationship between variables in a study. It provides a standardized measure of the strength of an effect or phenomenon being studied, independent of sample size.
Key Points:
Standardized Measure: Effect size is expressed in standard deviation units or other standardized metrics, making it comparable across different studies.
Interpretation: A larger effect size indicates a stronger relationship or more substantial difference between groups or conditions.
Example: In an A/B test measuring website conversion rates:
- If Group A (control) has a conversion rate of 5% and Group B (treatment) has a conversion rate of 7%, the effect size could be quantified using metrics like Cohen’s d or relative risk increase to indicate the practical significance of the difference.
Certainly! Cohen’s d and relative risk are commonly used effect size measures in different contexts, providing standardized ways to quantify and compare the magnitude of effects between groups or conditions in research studies. Here’s an explanation of each:
5.3.2.2 Cohen’s d
Definition: Cohen’s d is a standardized measure of effect size that indicates the difference between two means (e.g., treatment group mean and control group mean) in terms of standard deviation units.
Formula: $ d = $
where: - $ {X}_1$ and ${X}_2 $ are the means of the two groups, - $ s $ is the pooled standard deviation of the two groups.
Interpretation: - Effect Size Magnitude: Cohen’s d values are interpreted as follows: - Small effect: d = 0.2 - Medium effect: d = 0.5 - Large effect: d = 0.8
Example: In a study comparing the effectiveness of two teaching methods on exam scores: - If the mean exam score in Method A (treatment) is 80 and in Method B (control) is 75, and the pooled standard deviation is 10, then Cohen’s d would be \(\frac{80 - 75}{10} = 0.5\), indicating a medium effect size.
5.3.2.3 Relative Risk
Definition: Relative risk (RR) is a measure of the strength of association between a risk factor (or exposure) and an outcome in epidemiology and medical research. It compares the risk of an outcome occurring in the exposed group versus the unexposed (or control) group.
Formula: \[ RR = \frac{\text{Risk in exposed group}}{\text{Risk in unexposed group}} \]
Interpretation: - RR = 1: Indicates no association between exposure and outcome. - RR > 1: Indicates higher risk in the exposed group compared to the unexposed group. - RR < 1: Indicates lower risk in the exposed group compared to the unexposed group.
Example: In a clinical trial evaluating a new drug for heart disease: - If the risk of heart attack among patients taking the new drug is 10% and among patients not taking the drug (control group) is 20%, then the relative risk would be \(\frac{0.10}{0.20} = 0.5\). - This means patients taking the drug have half the risk of experiencing a heart attack compared to those not taking the drug.
5.3.3 Comparison and Usage
Cohen’s d: Typically used in studies comparing means of continuous variables (e.g., exam scores, reaction times) between two groups.
Relative Risk: Primarily used in studies of binary outcomes (e.g., disease incidence, event occurrence) to compare the risk of an outcome between exposed and unexposed groups.
Both measures provide valuable insights into the strength and direction of effects in research studies. The choice between Cohen’s d and relative risk depends on the nature of the data (continuous or binary) and the specific research question being addressed. Researchers often use these effect size measures alongside significance testing to provide a comprehensive assessment of the findings’ practical and statistical significance.
5.3.4 Significance
Definition: Statistical significance determines whether the observed results in a study are likely to be due to the intervention (or other factors being studied) rather than random chance. It is typically assessed through hypothesis testing.
Key Points: - Hypothesis Testing: Statistical tests (e.g., t-tests, ANOVA, chi-square tests) evaluate whether the observed differences between groups are statistically significant.
Threshold: Results are deemed statistically significant if the probability (p-value) of observing such differences due to chance alone is below a predefined significance level (commonly set at 0.05).
Does Not Equal Importance: Statistical significance does not necessarily equate to practical or clinical significance; it only indicates the reliability of the observed effect.
Example: In a clinical trial evaluating a new drug: - If the treatment group shows a significantly lower incidence of adverse effects compared to the control group (p < 0.05), it suggests that the drug may have a beneficial effect on reducing adverse reactions.
5.3.5 Group Size
Definition: Group size refers to the number of participants or subjects included in each experimental group or condition in a study. It directly influences the statistical power and precision of the study’s results.
Key Points: - Statistical Power: Larger group sizes generally increase the statistical power of a study, making it more likely to detect a true effect if one exists. - Precision: Larger group sizes reduce sampling variability and increase the precision of estimates (e.g., mean values, effect sizes). - Resource Allocation: Group size is often determined by practical considerations such as budget, time constraints, ethical considerations, and expected effect size.
Example: In an A/B test comparing two marketing strategies: - If Group A consists of 1000 customers and Group B consists of 500 customers, the study’s power to detect differences between the groups will be influenced by the unequal group sizes.
5.3.6 Relationship Between Effect Size, Significance, and Group Size
- Effect Size and Significance: A larger effect size increases the likelihood of achieving statistical significance with smaller group sizes. Conversely, smaller effect sizes may require larger group sizes to achieve statistical significance.
- Group Size and Precision: Larger group sizes generally provide more precise estimates of effects and reduce the impact of random variability in the data.
- Balancing Factors: Researchers often balance effect size, significance level, and group size to achieve meaningful and reliable results within practical constraints.
5.3.7 Conclusion
Understanding effect size, significance, and group size is crucial for interpreting and evaluating research findings accurately. Effect size measures the magnitude of effects, significance assesses the likelihood of results being due to chance, and group size influences the study’s statistical power and precision. Together, these concepts help researchers draw meaningful conclusions and inform decision-making based on empirical evidence.
5.3.7.1 Baseline conversion rate
The baseline conversion rate is the current conversion rate for the page you are testing. Conversion rate is the number of conversions divided by the total number of visitors.
5.3.7.2 Minimum detectable effect (MDE)
After baseline conversion rate, you need to decide how much change from the baseline (how big or small a lift) you want to detect. You wil need less traffic to detect big changes and more traffic to detect small changes.
To demonstrate, let us use an example with a 20% baseline conversion rate and a 5% MDE. Based on these values, your experiment will be able to detect 80% of the time when a variation’s underlying conversion rate is actually 19% or 21% (20%, +/- 5% × 20%). If you try to detect differences smaller than 5%, your test is considered underpowered.
Power is a measure of how well you can distinguish the difference you are detecting from no difference at all. So running an underpowered test is the equivalent of not being able to strongly declare whether your variations are winning or losing.
5.4 Size of the Control Group
Calculating the appropriate size of the control group in an experiment involves several important factors to ensure the study has sufficient power to detect a true effect if one exists. Here are the key factors that influence control group size calculation:
- Effect Size (δ)
Effect size is a measure of the magnitude of the difference between groups or the strength of the relationship between variables. It quantifies the practical significance of the treatment effect.
Influence on Sample Size:
- A larger effect size requires a smaller sample size to detect a significant difference.
- A smaller effect size requires a larger sample size to achieve the same level of power.
- Significance Level (α)
The significance level (α) is the threshold for determining whether the observed effect is statistically significant. It represents the probability of committing a Type I error (rejecting a true null hypothesis).
Common Values: - α is typically set at 0.05, meaning there is a 5% chance of rejecting the null hypothesis when it is true.
Influence on Sample Size:
A lower α (e.g., 0.01) requires a larger sample size to maintain the same power, as the test becomes more stringent.
A higher α (e.g., 0.10) allows for a smaller sample size but increases the risk of Type I errors.
- Power (1 - β)
Power is the probability of correctly rejecting a false null hypothesis. It reflects the study’s ability to detect an effect if one exists.
Common Values: - A common target for power is 0.80, indicating an 80% chance of detecting a true effect.
Influence on Sample Size: - Higher power (e.g., 0.90) requires a larger sample size. - Lower power (e.g., 0.70) allows for a smaller sample size but increases the risk of Type II errors (failing to detect a true effect).
- Variability (σ)
Variability refers to the spread or dispersion of data points within a population, often measured by the standard deviation (σ).
Influence on Sample Size: - Higher variability (greater standard deviation) requires a larger sample size to detect a significant difference. - Lower variability allows for a smaller sample size as the effect is easier to detect against a less noisy background.
- Allocation Ratio
Definition: The allocation ratio determines the proportion of participants assigned to the treatment group versus the control group. An equal allocation ratio (1:1) means equal numbers in both groups.
Influence on Sample Size: - Unequal allocation ratios (e.g., 2:1 or 3:1) may be used based on study design or practical considerations but can affect the total sample size required to achieve the desired power. - An equal allocation ratio generally provides the most statistically efficient design, minimizing the total sample size required.
- Dropout Rate
Definition: The dropout rate accounts for participants who may leave the study before its completion, affecting the effective sample size.
Influence on Sample Size: - Anticipated dropouts should be factored into the initial sample size calculation to ensure the study retains adequate power despite participant loss.
5.4.1 Sample Size Calculation Formula
For comparing two means, the sample size for each group can be calculated using:
\[ n = \left( \frac{(Z_{\alpha/2} + Z_{\beta}) \cdot \sigma}{\delta} \right)^2 \]
where: - \(n\) is the sample size per group, - \(Z_{\alpha/2}\) is the critical value for the desired significance level, - \(Z_{\beta}\) is the critical value for the desired power, - \(\sigma\) is the standard deviation, - \(\delta\) is the effect size.
5.4.2 Conclusion
Determining the control group size involves considering the desired effect size, significance level, power, variability in the data, allocation ratio, and potential dropout rates. Properly calculating the control group size ensures that the study is adequately powered to detect meaningful effects, thereby enhancing the validity and reliability of the research findings.
5.4.3 Statistical Assumptions for Randomized Controlled Trials (RCTs)
Randomized Controlled Trials (RCTs) are considered the gold standard in experimental research due to their ability to minimize bias and establish causality. However, the validity of RCT results depends on several key statistical assumptions:
Randomization:
- Assumption: Participants are randomly assigned to treatment and control groups.
- Purpose: Ensures that the groups are comparable on average, reducing selection bias and balancing both known and unknown confounders.
Independence:
- Assumption: Observations are independent of each other.
- Purpose: Ensures that the outcome of one participant does not influence the outcome of another, which is crucial for valid statistical inference.
Consistency:
- Assumption: The treatment effect is consistent across all participants.
- Purpose: Ensures that the treatment effect observed in the sample can be generalized to the broader population.
Exclusion of Confounders:
- Assumption: No confounding variables influence the treatment-outcome relationship.
- Purpose: Ensures that the observed effect is due to the treatment and not due to other external factors.
Stable Unit Treatment Value Assumption (SUTVA):
- Assumption: The potential outcomes for any participant are not affected by the treatment assignment of other participants.
- Purpose: Prevents interference between participants, ensuring that each participant’s outcome is solely a result of their treatment assignment.
No Systematic Differences in Measurement:
- Assumption: Measurement of outcomes is consistent and unbiased across treatment and control groups.
- Purpose: Ensures that outcome measures are not systematically biased by the treatment assignment.
5.4.4 Robustness Checks
Robustness checks involve testing the stability and reliability of the study’s findings under various assumptions and conditions. They help to confirm that the results are not sensitive to specific assumptions or potential biases. Key robustness checks for RCTs include:
- Sensitivity Analysis:
- Purpose: Evaluates how the results change with different assumptions or parameters.
- Method: Adjusting key assumptions or parameters (e.g., different definitions of the outcome variable) to see if the results remain consistent.
- Subgroup Analysis:
- Purpose: Examines the effect of the treatment within different subgroups of the sample.
- Method: Dividing the sample into subgroups (e.g., by age, gender, or baseline risk) and checking if the treatment effect is consistent across these groups.
- Placebo Tests:
- Purpose: Tests whether the results hold when using a placebo treatment.
- Method: Using a placebo group to confirm that the observed effects are specifically due to the treatment and not to other factors.
- Alternative Specifications:
- Purpose: Tests the robustness of the results to different model specifications.
- Method: Using alternative statistical models or different functional forms to ensure results are not model-dependent.
- Attrition Analysis:
- Purpose: Examines the impact of participant dropout on the study results.
- Method: Analyzing the characteristics of dropouts and conducting analyses to understand if and how attrition might bias the results.
5.4.5 Validation Methods
Validation methods are used to confirm the internal and external validity of the study’s findings. These methods help to ensure that the results are credible and can be generalized to other settings or populations.
- Internal Validity Checks:
- Balance Checks:
- Purpose: Ensures that randomization created comparable groups.
- Method: Comparing baseline characteristics between treatment and control groups to check for balance.
- Compliance Checks:
- Purpose: Ensures participants adhere to the assigned treatment.
- Method: Analyzing adherence rates and conducting per-protocol analyses if necessary.
- Balance Checks:
- External Validity Checks:
- Population Representativeness:
- Purpose: Ensures that the study sample is representative of the broader population.
- Method: Comparing sample characteristics to the target population and discussing potential generalizability limitations.
- Replication Studies:
- Purpose: Confirms the findings by replicating the study in different settings or with different populations.
- Method: Conducting similar studies in various contexts to see if the results hold.
- Population Representativeness:
- Model Validation:
- Cross-Validation:
- Purpose: Assesses the predictive accuracy of the statistical model.
- Method: Using techniques like k-fold cross-validation to test the model’s performance on different subsets of the data.
- Out-of-Sample Validation:
- Purpose: Ensures the model performs well on new, unseen data.
- Method: Validating the model on a separate dataset that was not used for model training.
- Cross-Validation:
5.4.6 Conclusion
The validity of RCT findings hinges on several key assumptions, and the credibility of these results is reinforced through robustness checks and validation methods. Robustness checks test the stability of findings under different conditions, while validation methods confirm the internal and external validity of the results, ensuring they are generalizable and reliable. Understanding and addressing these factors is crucial for conducting and interpreting high-quality research.