Chapter 3 Matching

3.1 Subclassification

Subclassification is a method used to satisfy the backdoor criterion by adjusting differences in means with strata-specific weights. These weights ensure that the distribution of means by strata matches the counterfactual’s strata distribution.

This method addresses the problem when treatment assignment is random, conditional on observables. The assumption in mathematical notation is:

\[ [Y^0, Y^1] \perp D \mid X \]

This implies that, given covariates $X$, the treatment assignment $D$ is independent of the potential outcomes (random).

Treatment can be assigned conditionally on covariates. For example, a state might assign students to three classes randomly, but first, schools are chosen, and then students are assigned randomly within those schools.

3.1.1 Example

Consider a study investigating the impact of cigar type on mortality. Without considering age, it appears that cigar or pipe users have higher mortality rates, which is controversial. However, age is a crucial factor that influences both cigar type selection (treatment) and mortality rate (outcome), making it a confounder.

In this method, my strategy for addressing covariate imbalance is to condition on age, ensuring that the age distribution is comparable between the treatment and control groups.

3.1.2 Step-by-Step Example

Bin Age into Groups: Create age groups (e.g., 18-30, 31-45, 46-60, 61+).
Calculate Percent Distribution: Determine the percentage of individuals in each age group for both treatment and control groups.
Weighted Mortality Rate:
- Calculate the mortality rate for each age group within each treatment group.
- Use the age group percentages to calculate a weighted average mortality rate for each treatment group.

3.1.2.1 Implementation

Suppose we have the following data:

Age Group	Treatment	Control	Treatment Deaths	Control Deaths	Treatment Total	Control Total
18-30	20%	25%	10	15	100	150
31-45	30%	20%	20	10	150	100
46-60	25%	30%	30	20	125	150
61+	25%	25%	40	30	125	125

Calculate Age-Specific Mortality Rates:

\[ \text{Mortality Rate (18-30)} = \frac{10}{100} = 10\%, \quad \text{Control Mortality Rate (18-30)} = \frac{15}{150} = 10\% \]

(Repeat for other age groups.)
Calculate Weighted Mortality Rates:

\[ \text{Weighted Mortality Rate (Treatment)} = 0.20 \times 10\% + 0.30 \times 13.3\% + 0.25 \times 24\% + 0.25 \times 32\% \]

\[ \text{Weighted Mortality Rate (Control)} = 0.25 \times 10\% + 0.20 \times 10\% + 0.30 \times 13.3\% + 0.25 \times 24\% \]

3.1.3 Considerations

Choice of Variables: Deciding which variables to use for adjustment can be challenging. Including too many variables can lead to the “curse of dimensionality,” where the data becomes too sparse in higher dimensions.
Common Support Assumption: This assumption requires that for each stratum, there exist observations in both the treatment and control groups. If the sample size is small, this assumption may be violated, making it difficult to compare groups effectively.

By carefully choosing variables and ensuring sufficient sample sizes within strata, subclassification can effectively adjust for covariate imbalances and yield more accurate estimates of treatment effects.

3.2 Exact Matching

Exact matching is a method used to estimate causal effects by pairing units in the treatment group with units in the control group that have identical values for certain covariates. This method helps in comparing the outcomes of similar units under different treatments to infer causal effects.

Why?

Because independence assumption is violated, and treatment assignment is not random.

3.2.1 Explanation

Suppose we have a treatment group and a control group, and we want to estimate the treatment effect by finding exact matches based on a covariate.

Matching Process:
- If a unit in the treatment group has a covariate value of 2, we look for a unit in the control group with the same covariate value of 2.
- If we find such a match, we use the outcome of the control unit to impute the counterfactual outcome for the treatment unit.
Handling Multiple Matches:
- If there are multiple control units with the same covariate value as the treatment unit, we take the average of those control units’ outcomes to impute the counterfactual for the treatment unit.
Calculating the Average Treatment Effect (ATE):
- By imputing counterfactual outcomes for each unit in both the control and treatment groups based on matching covariates, we can calculate the ATE.
Calculating the Average Treatment Effect on the Treated (ATT):
- Typically, we find exact matching control units for treatment units and calculate the ATT. This involves comparing only the matched pairs.
- In a typical example, control group is much larger than treatment group and it is much easier to find similar treated units within a larger control unit.

3.2.2 Example

Let’s consider an example where we are studying the effect of a new teaching method on student performance. We have two groups: students who received the new teaching method (treatment group) and students who received the traditional method (control group). We will use the exact matching method based on a covariate, such as prior test scores.

3.2.2.1 Step-by-Step Process

Identify Covariate:
- Prior test scores are used as the matching covariate.
Exact Matching:
- Suppose a student in the treatment group has a prior test score of 85.
- We look for students in the control group with a prior test score of 85.
- If we find multiple students in the control group with a prior test score of 85, we average their outcomes.
Impute Counterfactuals:
- For the treatment student with a prior test score of 85, we use the average outcome of the matched control students to impute the counterfactual outcome.
Create Matched Sample:
- The matched sample consists of pairs of treatment and control units with the same covariate value.
- For example, if we have another treatment student with a prior test score of 90, we find control students with a prior test score of 90 and repeat the process.
Calculate ATT:
- For each matched pair, we calculate the difference in outcomes.
- Average these differences to obtain the ATT.

3.2.2.2 Example Data

Student	Group	Prior Test Score	Outcome (Final Score)
A	Treatment	85	90
B	Treatment	90	88
C	Control	85	85
D	Control	90	86
E	Control	85	87

For Student A (Treatment, 85): Match with Students C and E (Control, 85). Average outcome: (85 + 87) / 2 = 86. Imputed counterfactual for A: 86.
For Student B (Treatment, 90): Match with Student D (Control, 90). Imputed counterfactual for B: 86.

3.2.2.3 ATT Calculation

Difference for Student A: 90 - 86 = 4
Difference for Student B: 88 - 86 = 2

\[ \text{ATT} = \frac{4 + 2}{2} = 3 \]

3.2.3 Conclusion

Exact matching helps in creating a comparable control group for each treatment unit based on covariates. By doing so, we can more accurately estimate the causal effect of the treatment.

However, this method requires sufficient overlap between the covariate distributions of the treatment and control groups, and the common support assumption must be satisfied.

3.3 Approximate Matching Methods

When you have multiple covariates to match or do not have exact matches, you can use approximate matching methods to find the best possible matches.

3.3.1 Nearest Neighbor Covariate Matching

When the number of matching covariates exceeds one, we need a new definition of distance to measure closeness between units. Multiple covariates not only introduce the curse-of-dimensionality problem but also complicate the measurement of distance. This poses challenges for finding a good match in the data and demands a large sample size for the matching discrepancies to be trivially small.

3.3.1.1 Euclidean Distance

Definition: Euclidean distance is a common measure of distance between two points in a multi-dimensional space.
Problem: The distance measure depends on the scale of the variables, which can distort the true closeness between points.

3.3.1.2 Normalized Euclidean Distance

Definition: This is the Euclidean distance normalized by the variance of the variables.
Advantage: Normalizing by variance adjusts for differences in scale among the covariates, making the distance measure more accurate.

3.3.1.3 Mahalanobis Distance

Definition: Mahalanobis distance is a scale-invariant distance metric that takes into account the correlations between variables.
Advantage: It adjusts for the scale and correlations of the covariates, providing a more accurate measure of distance.

3.3.2 Example

Suppose we are studying the impact of a job training program on employment outcomes. We have multiple covariates, such as age, education level, and prior work experience. We want to use approximate matching to find control units that are similar to the treatment units based on these covariates.

3.3.2.1 Step-by-Step Process

Identify Covariates:
- Age
- Education Level
- Prior Work Experience
Calculate Distances:
- Euclidean Distance:
  
  \[\text{Distance} = \sqrt{(X_1 - Y_1)^2 + (X_2 - Y_2)^2 + \cdots + (X_n - Y_n)^2}\]
- Normalized Euclidean Distance:
  
  \[\text{Distance} = \sqrt{\left(\frac{X_1 - Y_1}{\sigma_1}\right)^2 + \left(\frac{X_2 - Y_2}{\sigma_2}\right)^2 + \cdots + \left(\frac{X_n - Y_n}{\sigma_n}\right)^2}\]
- Mahalanobis Distance:
  
  \[\text{Distance} = \sqrt{(X - Y)^T S^{-1} (X - Y)}\]

where $S$ is the covariance matrix of the covariates.

Find Nearest Neighbors:
- For each treatment unit, calculate the distance to all control units using the chosen distance metric.
- Select the control unit with the smallest distance as the match for the treatment unit.
Calculate Treatment Effect:
- Compare the outcomes of matched pairs to estimate the treatment effect.

3.3.3 Hypothetical Data

Unit	Group	Age	Education Level	Prior Work Experience	Outcome
1	Treatment	25	Bachelor’s	2 years	Employed
2	Treatment	30	Master’s	5 years	Employed
3	Control	26	Bachelor’s	1 year	Unemployed
4	Control	29	Master’s	6 years	Employed

3.3.3.1 Matching Process

Calculate Normalized Euclidean Distances:
- Normalize the covariates by their variances.
- Compute distances between each treatment unit and all control units.
Match Units:
- Match Unit 1 (Treatment) with Unit 3 (Control) based on the smallest normalized Euclidean distance.
- Match Unit 2 (Treatment) with Unit 4 (Control) based on the smallest normalized Euclidean distance.
Estimate Treatment Effect:
- Compare outcomes of matched pairs:
  - Unit 1 (Treatment) vs. Unit 3 (Control): Employed vs. Unemployed
  - Unit 2 (Treatment) vs. Unit 4 (Control): Employed vs. Employed

By using approximate matching methods like nearest neighbor matching with normalized Euclidean or Mahalanobis distances, we can more accurately estimate the treatment effect even when dealing with multiple covariates and the lack of exact matches.

3.4 Propensity Score Methods

Propensity score methods are approximate matching techniques that use propensity scores as distance metrics. These methods offer several ways to perform matching based on propensity scores.

Propensity score matching (PSM) is a widely used method, particularly in medical sciences, for addressing selection on observables. However, it has not been as widely adopted among economists as other non-experimental methods like regression discontinuity or difference-in-differences. This reluctance is largely due to skepticism about the conditional independence assumption (CIA) being achievable in any dataset. Economists are often more concerned about selection on unobservables than selection on observables, making them less likely to use matching methods.

3.4.1 Concept

Propensity score matching is used when a conditioning strategy can satisfy the backdoor criterion.

The method involves estimating a model (usually logit or probit) to predict the conditional probability of treatment based on covariates.

The predicted values from this estimation, called propensity scores, collapse the covariates into a single scalar. Comparisons between the treatment and control groups are then based on these propensity scores.

3.4.2 Steps

Estimate Propensity Scores: Use a logit or probit model to estimate the probability of receiving the treatment based on observed covariates.
Match Units: Match treatment units with control units that have similar propensity scores.
Assess Overlap: Ensure that there is common support, meaning there are units in both treatment and control groups across the range of propensity scores.
Calculate Treatment Effect: Compare outcomes between matched treatment and control units to estimate the treatment effect.

3.4.3 Example

Suppose we are studying the effect of a job training program on employment outcomes. We have the following covariates: age, education level, and prior work experience. We will use propensity score matching to estimate the effect of the program.

3.4.3.1 Step-by-Step Process

Estimate Propensity Scores:
- Fit a logit model to predict the probability of receiving the job training based on age, education level, and prior work experience.

Example logit model: \[P(Treatment) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \cdot Age + \beta_2 \cdot Education + \beta_3 \cdot Experience)}} \]

Calculate Propensity Scores:

Use the fitted logit model to calculate the propensity score for each unit.

Match Units:

Match each treatment unit with one or more control units that have similar propensity scores.
Example matching method: Nearest neighbor matching.

NN matching is greedy in the sense that each pairing occurs without reference to how other units will be or have been paired, and therefore does not aim to optimize any criterion. Nearest neighbor matching is the most common form of matching used.

For large datasets (i.e., in 10,000s to millions), some matching methods will be too slow to be used at scale. Instead, users should consider generalized full matching, subclassification, or coarsened exact matching, which are all very fast and designed to work with large datasets.

(Nice article on MatchIt)[https://cran.r-project.org/web/packages/MatchIt/vignettes/matching-methods.html]

Assess Overlap:

Check for common support to ensure there is overlap in propensity scores between the treatment and control groups.
If there is no overlap, adjust the model or consider other methods.

Calculate Treatment Effect:

Compare employment outcomes between matched treatment and control units.
Calculate the average treatment effect on the treated (ATT).

3.4.3.2 Example Data

Unit	Group	Age	Education Level	Prior Work Experience	Outcome	Propensity Score
1	Treatment	25	Bachelor’s	2 years	Employed	0.70
2	Treatment	30	Master’s	5 years	Employed	0.85
3	Control	26	Bachelor’s	1 year	Unemployed	0.65
4	Control	29	Master’s	6 years	Employed	0.80

Match Units:

Match Unit 1 (Treatment) with Unit 3 (Control) based on similar propensity scores (0.70 vs. 0.65).
Match Unit 2 (Treatment) with Unit 4 (Control) based on similar propensity scores (0.85 vs. 0.80).
Not very intuitive or even confusing: There are many ways to use PS for matching

Calculate ATT:

Compare outcomes of matched pairs: - Unit 1 (Treatment) vs. Unit 3 (Control): Employed vs. Unemployed - Unit 2 (Treatment) vs. Unit 4 (Control): Employed vs. Employed

3.4.4 Assumptions and Considerations

Conditional Independence Assumption (CIA):
- Treatment assignment is independent of potential outcomes given the observed covariates.
- This assumption is crucial for PSM to provide unbiased estimates.
Common Support:
- There must be overlap in the distribution of propensity scores between the treatment and control groups.
- Lack of common support can lead to biased estimates as some treatment units may have no comparable control units.
Model Specification:
- The logit or probit model must be correctly specified to accurately estimate propensity scores.
- Including relevant covariates and interactions is important for achieving balance between groups.
Sample Size:
- PSM requires a sufficiently large sample size to find good matches for each treatment unit.
- Smaller samples may lead to poor matches and biased estimates.

Propensity score matching is a powerful tool for estimating causal effects in observational studies, but it relies heavily on the assumptions of CIA and common support. Proper model specification and adequate sample size are essential for obtaining reliable estimates.

3.5 Inverse Probability Weighting (Weighting on the propensity score)

It is weighting treatment and control units according to, which is causing units with very small values of the propensity score to blow up and become unusually influential in the calculation of ATT. Thus, we will need to trim the data.

A good rule of thumb, they note, is to keep only observations on the interval [0.1,0.9], which was performed at the end of the program.

We still need to calculate standard errors, such as based on a bootstrapping method,

The sensitivity of inverse probability weighting to extreme values of the propensity score has led some researchers to propose an alternative that can handle extremes a bit better.

Most software packages have programs that will estimate the sample analog of these inverse probability weighted parameters that use the second method with normalized weights.

3.6 Nearest-neighbor matching

An alternative, very popular approach to inverse probability weighting is matching on the propensity score. This is often done by finding a couple of units with comparable propensity scores from the control unit donor pool within some ad hoc chosen radius distance of the treated unit’s own propensity score. The researcher then averages the outcomes and then assigns that average as an imputation to the original treated unit as a proxy for the potential outcome under counterfactual control. Then effort is made to enforce common support through trimming.

Nevertheless, nearest-neighbor matching, along with inverse probability weighting, is perhaps the most common method for estimating a propensity score model.

Nearest-neighbor matching using the propensity score pairs each treatment unit with one or more comparable control group units , where comparability is measured in terms of distance to the nearest propensity score. This control group unit’s outcome is then plugged into a matched sample. Once we have the matched sample, we can calculate the ATT as

We will focus on the ATT because of the problems with overlap that we discussed earlier.

We can chose to match using K nearest neighbors. Nearest neighbors, in other words, will find the K nearest units in the control group, where “nearest” is measured as closest on the propensity score itself. Unlike covariate matching, distance here is straightforward because of the dimension reduction afforded by the propensity score. We then average actual outcome, and match that average outcome to each treatment unit. Once we have that, we subtract each unit’s matched control from its treatment value, and then divide by N, the number of treatment units.

3.6.1 Example in R

library(MatchIt)
library(Zelig)

m_out <- matchit(treat ~ age + agesq + agecube + educ +
                 educsq + marr + nodegree +
                 black + hisp + re74 + re75 + u74 + u75 + interaction1,
                 data = nsw_dw_cpscontrol, method = "nearest", 
                 distance = "logit", ratio =5)

m_data <- match.data(m_out)

z_out <- zelig(re78 ~ treat + age + agesq + agecube + educ +
               educsq + marr + nodegree +
               black + hisp + re74 + re75 + u74 + u75 + interaction1, 
               model = "ls", data = m_data)

x_out <- setx(z_out, treat = 0)
x1_out <- setx(z_out, treat = 1)

s_out <- sim(z_out, x = x_out, x1 = x1_out)

summary(s_out)

3.7 Coarsened Exact Matching

Coarsened Exact Matching (CEM) is a method based on the idea that exact matching can often be achieved by coarsening the data. Coarsening involves creating categorical variables (e.g., 0- to 10-year-olds, 11- to 20-year-olds), making it easier to find exact matches. Once exact matches are found, weights are calculated based on where a person fits within certain strata, and these weights are used in a simple weighted regression.

This method can be implemented using the MatchIt library in R.

3.7.1 Example

Consider the variable “schooling,” which can be categorized as: - Less than high school - High school only - Some college - College graduate - Post-college

Each observation is placed into one of these categories, creating strata for each unique combination of observations. Assign these strata to the original (uncoarsened) data, and drop any observation whose stratum doesn’t contain at least one treated and one control unit. Then, add weights based on stratum size and analyze the data without further matching.

3.7.2 Steps in Coarsened Exact Matching

Coarsen the Data: Transform continuous or detailed categorical variables into coarser categories.
- Example: Age groups (0-10, 11-20, etc.) or education levels (less than high school, high school, etc.).
Create Strata: For each unique combination of the coarsened variables, create a stratum.
- Each observation is assigned to a stratum based on its coarsened characteristics.
Assign Weights: Calculate weights for each observation based on the stratum size.
- Observations in larger strata receive smaller weights and vice versa.
Drop Unmatched Observations: Remove any strata that do not contain both treated and control units.
Weighted Regression: Use the weights in a regression analysis to account for the matching process.

3.7.3 Considerations

Balance: Coarsening can improve balance between treated and control groups but may result in some loss of information.
Weight Calculation: Weights should reflect the relative sizes of the strata to ensure accurate representation in the analysis.
Implementation: Ensure the coarsening process does not oversimplify the data, potentially masking important variations.

By carefully coarsening the data and using appropriate weights, CEM allows for more accurate and reliable estimation of treatment effects, even when exact matching on the original variables is not feasible.

3.8 A/B Test article from Medium

3.8.1 Example: Conversion Rate of an E-Commerce Website

Article Source

Suppose an e-commerce website wants to test if implementing a new feature (e.g., layout or button) will significantly improve conversion rate.

conversion rate: number of purchases divided by number of sessions/visits

We can randomly show the new webpage to 50% of the users. Then, we have a test group and a control group. Once we have enough data points, we can test if the conversion rate in the treatment group is significantly higher (one side test) than that in the control group.

The null hypothesis is that conversion rates are not significantly different in the two group.

Sample Size for Comparing Two Means.

One way to perform the test is to calculate daily conversion rates for both the treatment and the control groups.

Since the conversion rate in a group on a certain day represents a single data point, the sample size is actually the number of days.

Thus, we will be testing the difference between the mean of daily conversion rates in each group across the testing period. The formula for estimate minimum sample size is as follows:

Sample Size Estimate for A/B Test

In an A/B test, the sample size ($n$) required for each group can be estimated using the formula:

\[n = \frac{{2 \cdot (Z_{\alpha/2} + Z_{\beta})^2 \cdot \sigma^2}}{{\delta^2}}\]

where: $ n : $ $ Z_{/2} : $ $ Z_{} : $ $ ^2 : $ $ : $

This formula helps in determining the sample size needed to achieve desired levels of significance and power in an A/B test.

For our example, let’s assume that the mean daily conversion rate for the past 6 months is 0.15 and the sample standard deviation is 0.05.

With the new feature, we expect to see a 3% absolute increase in conversion rate. Thus, for the conversion rate for the treatment group will be 0.18. We also assume that the sample standard deviations are the same for the two group.

Our parameters are as follows.

$\mu_1 = 0.15$ $\mu_2 = 0.18$ $\sigma_1 = \sigma_2 = 0.05$

Assuming $\alpha = 0.05$ and $\beta = 0.20$ ($power = 0.80$), applying the formula, the required minimum sample size is 35 days.

This is consistent with the result from this web calculator.

Sample Size for Comparing Two Proportions

The two-means approach considers each day+group as a data point. But what if we focus on individual users and visits?

What if we want to know how many visits/sessions are required for the testing? In this case, the conversion rate for a group is basically all purchases divided by all sessions in that group. If each session is a Bernoulli trial (convert or not), each group follows a binomial distribution.

To test the difference in conversion rate between the treatment and control groups, we need a test of two proportions. The formula for estimating the minimum required sample size is as follows.

Summary: Sample Size Estimate for Comparing Proportions

When comparing proportions in two independent groups, the sample size ($n$) required for each group can be estimated using the formula:

\[n = \frac{{2 \cdot (Z_{\alpha/2} + Z_{\beta})^2 \cdot (p(1-p))}}{{\delta^2}}\]

where:

$n : \text{ Sample size per group}$
$Z_{\alpha/2} : \text{ Critical value for significance level}$
$Z_{\beta} : \text{ Critical value for desired power}$
$p : \text{ Expected proportion in one group}$
$\delta : \text{ Minimum detectable difference in proportions}$

This formula helps in determining the sample size needed to detect a specified difference in proportions between two groups with desired levels of significance and power.

Assuming 50–50 split, we have the following parameters:

$p_1 = 0.15$ $p_2 = 0.18$
$k = 1$

Using $\alpha = 0.05$ and $\beta = 0.20$, applying the formula, the required sample size is $1,892$ sessions per group.

3.8.2 Example: A/B Test

A/B testing is an experiment where two or more variants are evaluated using statistical analysis to determine which variation performs better for a given conversion goal.

A/B testing is widely used by digital marketing agencies as it is the most effective method to determine the best content for converting visits into sign-ups and purchases.

In this scenario, you will set up hypothesis testing to advise a digital marketing agency on whether to adopt a new ad they designed for their client.

Assume you are hired by a digital marketing agency to conduct an A/B test on a new ad hosted on a website. Your task is to determine whether the new ad outperforms the existing one.

The agency has run an experiment where one group of users was shown the new ad design, while another group was shown the old ad design. The users’ interactions with the ads were observed, specifically whether they clicked on the ad or not.

3.9 Task 1: Load the data

In this task, we will import our libraries and then load our dataset

library(tidyverse)

## ── Attaching core tidyverse packages ────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ──────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
df <- read_excel('/Users/deayan/Desktop/GITHUB/10_Causal_Notes/__repo/data/AB_Test.xlsx')

glimpse(df)

## Rows: 3,757
## Columns: 2
## $ group  <chr> "experiment", "control", "control", "control", "control", "cont…
## $ action <chr> "view", "view", "view and click", "view and click", "view", "vi…

df%>%
  group_by(group, action)%>%
  summarise(n = n(),
            .groups = "drop")

## # A tibble: 4 × 3
##   group      action             n
##   <chr>      <chr>          <int>
## 1 control    view            1513
## 2 control    view and click   363
## 3 experiment view            1569
## 4 experiment view and click   312

3.10 Task 2: Set up Hypothesis

experiment group: the group that is involved in the new experiment . i.e the group that received the new ad .

Control group: the 2nd group that didn’t receive the new ad

Click-through rate (CTR): the number of clicks advertisers receive on their ads per number of impressions.

table(df$group)

## 
##    control experiment 
##       1876       1881

df%>%count(group)

## # A tibble: 2 × 2
##   group          n
##   <chr>      <int>
## 1 control     1876
## 2 experiment  1881

table(df)

##             action
## group        view view and click
##   control    1513            363
##   experiment 1569            312

prop.table(table(df), 1)

##             action
## group             view view and click
##   control    0.8065032      0.1934968
##   experiment 0.8341308      0.1658692

df %>%
  count(group) %>%
  mutate(prop = n / sum(n))

## # A tibble: 2 × 3
##   group          n  prop
##   <chr>      <int> <dbl>
## 1 control     1876 0.499
## 2 experiment  1881 0.501

x <- df %>%
  group_by(group, action)%>%
  summarise(n = n(), .groups = 'drop')%>%
  pivot_wider(names_from = action, values_from = n, values_fill = list(n = 0))
  
names(x) <- c("group", "view", "view_click")

x%>%
  group_by(group)%>%
  transmute(view1 = view/(view+view_click),
         view_click1 = view_click/(view+view_click))

## # A tibble: 2 × 3
## # Groups:   group [2]
##   group      view1 view_click1
##   <chr>      <dbl>       <dbl>
## 1 control    0.807       0.193
## 2 experiment 0.834       0.166

The null hypothesis is what we assume to be true before we collect the data, and the alternative hypothesis is usually what we want to try and prove to be true.

So in our experiment than null hypothesis is assuming that the old ad is better than than new one.

Then we set the significance level $\alpha$.

The significance level is the probability of rejecting the null hypothesis when it’s true. (Type I error rate)

For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

Lower significance levels indicate that you require stronger evidence before you reject the null hypothesis.

So we will set our significance level to be 0.05. And if we reject the null hypothesis as a result of our experiment, then by having significant level of 0.05 then we are 95% confident that we can reject the null hypothesis.

So setting the significance level is about how confident you are while you reject the null hypothesis, the fourth step is calculating the corresponding P value.

The definition of P value is the probability of observing your statistic, if the null hypothesis is true. And then we will draw a conclusion whether to go for the new ad or not.

Hypothesis Testing steps: 1) Specify the Null Hypothesis.
2) Specify the Alternative Hypothesis.
3) Set the Significance Level (a)
4) Calculate the Corresponding P-Value.
5) Drawing a Conclusion

Our Hypothesis

Hypothesis is that the click through rate associated with the new ad is less than that associated with the old ad, which means that the old ad is better than than new one.

And the alternative hypothesis will be the opposite.

3.11 Task 3: Compute the difference in the click-through rate

This task we will compute the difference in the click through rate between the control and experiment groups.

control_df <- df[df$group == "control", ]
experiment_df <- df[df$group == "experiment", ]

control_ctr <- 
    mean(ifelse(control_df$action=="view and click", 1, 0))

experiment_ctr <-
    mean(ifelse(experiment_df$action=="view and click", 1, 0))

diff <- experiment_ctr - control_ctr
diff

## [1] -0.02762758