Chapter 6 Difference-in-Differences (DiD) Methods

Difference-in-Differences (DiD) is a quasi-experimental technique used in econometrics to estimate causal relationships. It compares the changes in outcomes over time between a treatment group and a control group.

Some resource links Comprehensive resource

Extra reading 2

Mixtape

youtube series

Books

The Effect

What if?

[Mathaeus - personal](Extra reading - python](https://matheusfacure.github.io/python-causality-handbook/13-Difference-in-Differences.htm)

6.1 Simple Difference-in-Differences (DiD)

  • Basic Idea: Difference-in-Differences (DiD) is a quasi-experimental design used in econometrics to estimate causal relationships. It compares the changes in outcomes over time between a treatment group and a control group.

  • Treatment assignment is not random, but we observe both treated and untreated units before and after treatment.

-Under certain structural assumptions, especially parallel outcome trends in the absence of treatment, we can recover the average treatment effect.

  • Formula: The basic DiD estimator is:

\[ \text{DiD} = (\text{Y}_{\text{post-treatment, treatment group}} - \text{Y}_{\text{pre-treatment, treatment group}}) - (\text{Y}_{\text{post-treatment, control group}} - \text{Y}_{\text{pre-treatment, control group}}) \]

Concept:

DiD is used when we have data from before and after a treatment is applied to a treatment group, and we also have a control group that does not receive the treatment.

The key assumption is that in the absence of treatment, the difference between the treatment and control groups would have remained constant over time (parallel trends assumption).

  • Simple 2x2 DD collapses to true ATT when parallel trend holds true.

  • ATT can be calculated through differencing outcomes but regression can be used instead if we want to control for some more covariates.

  • if you need to avoid omitted variable bias through controlling for endogenous covariates that vary over time, then you may want to use regression. Such strategies are another way of saying that you will need to close some known critical backdoor.

  • Another reason for the equation is that by controlling for more appropriate covariates, you can reduce residual variance and improve the precision of your DD estimate.

Model:

\[ Y_{it} = \alpha + \beta_1 \text{Post}_t + \beta_2 \text{Treated}_i + \beta_3 (\text{Post}_t \times \text{Treated}_i) + \epsilon_{it} \]

where:

  • \(Y_{it}\) is the outcome variable for entity \(i\) at time \(t\).

  • \(\text{Post}_t\) is a dummy variable equal to 1 for periods after the treatment and 0 otherwise.

  • \(\text{Treated}_i\) is a dummy variable equal to 1 for the treatment group and 0 for the control group.

  • \(\beta_3\) is the DiD estimator, representing the treatment effect (ATT).

6.2 Controversial Note

The variables of interest in many of these setups only vary at a group level, such as the state, and outcome variables are often serially correlated. In Card and Krueger (1994), it is very likely for instance that employment in each state is not only correlated within the state but also serially correlated.

Bertrand, Duflo, and Mullainathan (2004) point out that the conventional standard errors often severely understate the standard deviation of the estimators, and so standard errors are biased downward, “too small,” and therefore overreject the null hypothesis. Bertrand, Duflo, and Mullainathan (2004) propose the following solutions:

  • Block bootstrapping standard errors.

  • Aggregating the data into one pre and one post period.

This approach ignores the time-series dimensions altogether, and if there is only one pre and post period and one untreated group, it’s as simple as it sounds.

  • Clustering standard errors at the group level.

You simply adjust standard errors by clustering at the group level, as we discussed in the earlier chapter, or the level of treatment. For state-level panels, that would mean clustering at the state level, which allows for arbitrary serial correlation in errors within a state over time. This is the most common solution employed.

If number of groups is small, then you may use wild bootstrap technique, or randomization inference.

6.4 Two-Way Fixed Effects Model

Concept:

The two-way fixed effects model extends the simple DiD approach by controlling for time-invariant characteristics of the entities and common shocks over time.

It adds fixed effects for both entities and time periods to control for unobserved heterogeneity.

Model:

\[ Y_{it} = \alpha_0 + \beta_1\text{Treat}_i + \beta_2\text{Post}_t + \beta_3 (\text{Post}_t \times \text{Treat}_i) + \epsilon_{it} \]

where:

  • \(\beta_1\) represents entity fixed effects.

  • \(\beta_2\) represents time fixed effects.

  • \(\beta_3\) remains the DiD estimator.

Example: Using the job training program example, this model would account for fixed characteristics of individuals (such as inherent employability) and time-specific effects (such as economic conditions).

\[ Y_{it} = \alpha_i + \gamma_t + \beta_3 (\text{Post}_t \times \text{Treated}_i) + \epsilon_{it} \]

This controls for both individual-specific and time-specific unobserved heterogeneity, providing a more robust estimate of the treatment effect.

6.5 Event Study Methods

Concept:

Event studies extend DiD by examining the dynamics of the treatment effect over multiple periods before and after the treatment.

  • They allow for the estimation of treatment effects at different time points relative to the treatment event.

  • As with many contemporary DD designs, Miller et al. (2019) evaluate the pre-treatment leads instead of plotting the raw data by treatment and control. Post-estimation, they plotted regression coefficients with 95% confidence intervals on their treatment leads and lags. Including leads and lags into the DD model allowed the reader to check both the degree to which the post-treatment treatment effects were dynamic, and whether the two groups were comparable on outcome dynamics pre-treatment.

Typical Model:

\[ Y_{ist} = \alpha_s + \gamma_t + \sum_{x=-q}^{-1} \beta_x D_{sx} + \sum_{x=0}^{m} \delta_x D_{sx} + X_{ist} + \epsilon_{ist} \]

You include \(q\) leads or anticipatory effects and \(m\) lags or post-treatment effects.

6.6 Importance of Placebos in DD

It is a simple idea. For the minimum wage sttaudy, one candidate placebo falsification might simply be to use data for an alternative type of worker whose wages would not be affected by the binding minimum wage. This reasoning might lead us to consider the possibility that higher wage workers might function as a placebo.

Many people like to be straightforward and simply fit the same DD design using high wage employment as the outcome. If the coefficient on minimum wages is zero when using high wage worker employment as the outcome, but the coefficient on minimum wages for low wage workers is negative, then we have provided stronger evidence that complements the earlier analysis we did when on the low wage workers.

Another way to show placebo falsification. Triple DDD.

6.6.1 Triple Differences

\[ Y_{ijt} = \alpha + \beta_0X_{ist} + \beta_1\gamma_t + \beta_2\delta_j + \beta_3 D_i + \beta_4 (\delta . \gamma)_{jt} + \beta_5 (\gamma . D)_{ti} + \beta_6 (\delta . D)_{ij} + \beta_7 (\delta . \gamma . D)_{ijt} + \epsilon_{ijt} \]

where the parameter of interest is \(\beta_7\).

  • This requires a stacking of the data into a panel structure by group, as well as state. Second, the DDD model requires that you include all possible interactions across the group dummy \(\delta_j\), post-treatment dummy \(\gamma_t\) and treatment state dummy \(D_i\).

  • The regression must include each dummy independently, each individual interaction, and the triple differences interaction. One of these will be dropped due to multicollinearity, but I include them in the equation so that you can visualize all the factors used in the product of these terms.

6.7 Compositional Changes

DD can be applied to repeated cross-sections, as well as panel data. But one of the risks of working with the repeated cross-sections is that unlike panel data (e.g., individual-level panel data), repeated cross-sections run the risk of compositional changes.

This kind of compositional change is a like an omitted variable bias built into the sample itself caused by time-variant unobservables. Diffusion of the Internet appears to be related to changing samples as younger music fans are early adopters. Identification of causal effects would need for the treatment itself to be exogenous to such changes in the composition.

6.8 Key Assumptions

  • Parallel Trends Assumption: The treatment and control groups would have followed the same trend over time in the absence of the treatment. This is the most critical assumption.

  • Common Shocks: Both groups are assumed to be subject to the same external factors over time.

6.8.1 Implementation Steps

  1. Identify Treatment and Control Groups: Clearly define which units are exposed to the treatment and which are not.

  2. Collect Data: Obtain data on the outcome of interest for both groups before and after the treatment.

  3. Check Parallel Trends: Visualize and statistically test if the pre-treatment trends of the groups are parallel.

  4. Estimate the Model: Use regression analysis to estimate the DiD effect. The basic regression model is: \[ Y_{it} = \alpha + \beta_1 \text{Post}_t + \beta_2 \text{Treatment}_i + \beta_3 (\text{Post}_t \times \text{Treatment}_i) + \epsilon_{it} \] where \(\beta_3\) is the DiD estimator.

6.8.2 Advantages

  • Controls for Time-Invariant Differences: Differences between the treatment and control groups that do not change over time are accounted for.

  • Simple and Intuitive: The method is straightforward to understand and implement.

6.8.3 Limitations

  • Violation of Parallel Trends: If the parallel trends assumption is violated, the DiD estimate can be biased.

  • External Validity: The results are only valid for the sample and period studied.

  • Simultaneous Interventions: Other changes occurring simultaneously with the treatment can confound the results.

6.9 Notes

  • Bertrand, Duflo, and Mullainathan (2004) point out that conventional robust standard errors usually overestimate the actual standard deviation of the estimator. The authors recommend clustering the standard errors at the level of randomization (e.g. classes, counties, villages, …).

  • Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania (1994) by Card and Krueger.

6.9.1 Example: Business

A firm wants to test the impact of a TV advertisement campaign on revenue. The firm releases the ad on a random sample of municipalities and track the revenue over time, before and after the ad campaign.

6.10 Extra Considerations

  • Two-way Fixed Effects (TWFE) model can give wrong estimates. This is very likely especially if treatments are heterogeneous (differential treatment timings, different treatment sizes, different treatment statuses over time) that can contaminate the treatment effects. This can result from “bad” treatment combinations biased the average treatment estimation to the point of even reversing the sign.

The new DiD methods “correct” for these TWFE biases by combining various estimation techniques, such as bootstrapping, inverse probability weights, matching, influence functions, and imputations, to handle parallel trends, negative weights, covariates, and controls.

6.11 Synthetic Difference-in-Differences (SynthDiD) method

Reading

SynthDiD is a generalized version of Synthetic Control Method (SCM) and DiD that combines the strengths of both methods. It enables causal inference with large panels, even with a short pretreatment period.

On the other hand, synthetic DiD combines the synthetic control method with the difference-in-differences approach [1]. In this method, a synthetic control group is constructed using the same approach as in the synthetic control method. However, the treatment effect is estimated by comparing the change in outcomes between the treated unit and the synthetic control group before and after the treatment is introduced. This approach allows for a more robust estimation of the treatment effect by accounting for pre-existing differences between the treatment and control groups.

In summary, while both methods use a synthetic control group, the synthetic control method estimates treatment effects by comparing the post-treatment outcomes of the treated unit to those of the synthetic control group, while synthetic DiD estimates treatment effects by comparing the change in outcomes between the treated unit and the synthetic control group before and after the treatment is introduced.

It constructs a counterfactual for the treated group by optimally weighting the control group units to minimize the difference between the treated and control groups in the pretreatment period as in SCM.

Then, the treatment effect is estimated by comparing the outcome changes in the treated unit and synthetic control group pre- and post-intervention as in DiD.

6.11.0.1 An Example:

Suppose that we are a company that sells plant-based food products, such as soy milk or soy yogurt, and we operate in multiple countries. Some countries implement new legislation that prohibits us from marketing our plant-based products as ‘milk’ or ‘yogurt’ because it is claimed that only animal products can be marketed as ‘milk’ or ‘yogurt’. Thus, due to this new regulation in some countries, we have to market soy milk as soy drink instead of soy milk, etc. We want to know the impact of this legislation on our revenue as this might help guide our lobbying efforts and marketing activities in different countries.

I simulated a balanced panel dataset that shows the revenue of our company in 30 different countries for 30 periods. Three of the countries implement this legislation in period 20. In the figure below, you can see a snapshot of the data. treat is a dummy variable indicating whether a country has implemented the legislation in a given period. revenueis the revenue in millions of EUR. You can find the simulation and estimation code in this Gist.

# Install and load the required packages
# devtools::install_github("synth-inference/synthdid")
library(synthdid)
library(ggplot2)
library(fixest) # Fixed-effects regression
library(data.table)
# Set seed for reproducibility
set.seed(12345)

source('sim_data.R') # Import simulation function and some utilities

dt <- sim_data()
head(dt)

In Data, there are 30 units (3 units treated), 30 periods (10 periods treated), all units are treated at the same time.

Next, we convert our panel data into a matrix required by the synthdid package. Given the outcome, treatment and control units and pretreatment periods, a synthetic control is created and treatment effect is estimated with synthdid_estimate function.

# Convert the data into a matrix
setup = panel.matrices(dt, unit = 'country', time = 'period', 
                       outcome = 'revenue', treatment = 'treat')

# Estimate treatment effect using SynthDiD
tau.hat = synthdid_estimate(setup$Y, 
                            setup$N0,
                            setup$T0)
print(summary(tau.hat))

To make inference, we also need to calculate the standard errors. I use jacknife method as I have more than one treated units. Placebo method is the only option if you have one treatment unit. Given the standard errors, I also calculate the 95% confidence interval for the treatment effect. I will report these in the figure below.

When there are multiple treated units (more than one unit that received the treatment or intervention), one common approach to estimating standard errors is using the jackknife method. The jackknife method is a resampling technique where each observation (in this case, each treated unit) is systematically omitted from the dataset, and the analysis is repeated each time to estimate the variance of the treatment effect. This provides a robust estimate of the standard errors that accounts for the potential variability across different treated units.

On the other hand, if there is only one treated unit (a single unit that received the treatment), using the jackknife method becomes impractical because there are not enough units to systematically leave out and still perform meaningful resampling. In such cases, the placebo method becomes a viable option.

The placebo method involves creating placebo or synthetic treated units that mimic the characteristics of the treated unit but did not actually receive the treatment. By comparing the outcomes of the actual treated unit with those of the synthetic placebo units, researchers can estimate the variability and potential impact of the treatment effect more accurately.

Therefore, the choice between the jackknife method and the placebo method depends on the number of treated units available for analysis within the synthetic control framework. Multiple treated units allow for the application of the jackknife method, whereas a single treated unit necessitates the use of the placebo method to estimate standard errors and make reliable inferences about the treatment effect.

# Calculate standard errors 
se = sqrt(vcov(tau.hat, method='jackknife'))
te_est <- sprintf('Point estimate for the treatment effect: %1.2f', tau.hat)
CI <- sprintf('95%% CI (%1.2f, %1.2f)', tau.hat - 1.96 * se, tau.hat + 1.96 * se)
# Plot treatment effect estimates
plot(tau.hat)
plot(tau.hat, se.method='jackknife')

In the image below, the estimation results are displayed. Observe how the treated countries and the synthetic control exhibit fairly parallel trends on average (it might not look like a perfect parallel trends but that is not necessary for the sake of this example). The average for treated countries is more variable, primarily due to the presence of only three such countries, resulting in less smooth trends. Transparent gray lines represent different control countries. Following the treatment in period 20, a decline in revenue is observed in the treated countries, estimated to be 0.51 million EUR as indicated in the graph. This means that the new regulation has a negative impact on our company’s revenues and necessary actions should be taken to prevent further declines.

# Check the number of treatment and control countries to report
num_treated <- length(unique(dt[treat==1]$country))
num_control <- length(unique(dt$country))-num_treated

# Create spaghetti plot with top 10 control units
top.controls = synthdid_controls(tau.hat)[1:10, , drop=FALSE]
plot(tau.hat, spaghetti.units=rownames(top.controls),
     trajectory.linetype = 1, line.width=.75, 
     trajectory.alpha=.9, effect.alpha=.9,
     diagram.alpha=1, onset.alpha=.9, ci.alpha = .3, spaghetti.line.alpha =.2,
     spaghetti.label.alpha = .1, overlay = 1) + 
  labs(x = 'Period', y = 'Revenue', title = 'Estimation Results', 
       subtitle = paste0(te_est, ', ', CI, '.'), 
       caption = paste0('The number of treatment and control units: ', num_treated, ' and ', num_control, '.'))

Let’s plot the weights use to estimate the synthetic control.

# Plot control unit contributions
synthdid_units_plot(tau.hat, se.method='jackknife') +
  labs(x = 'Country', y = 'Treatment effect', 
       caption = 'The black horizontal line shows the actual effect; 
       the gray ones show the endpoints of a 95% confidence interval.')
ggsave('../figures/unit_weights.png')

In the image below, you can observe how each country is weighted to construct the synthetic control. The treatment effects differ based on the untreated country selected as the control unit.

# Check for pre-treatment parallel trends
plot(tau.hat, overlay=1, se.method='jackknife')
ggsave('../figures/results_simple.png')
# Check the number of treatment and control countries to report
num_treated <- length(unique(dt[treat==1]$country))
num_control <- length(unique(dt$country))-num_treated


# Create spaghetti plot with top 10 control units
top.controls = synthdid_controls(tau.hat)[1:10, , drop=FALSE]
plot(tau.hat, spaghetti.units=rownames(top.controls),
     trajectory.linetype = 1, line.width=.75, 
     trajectory.alpha=.9, effect.alpha=.9,
     diagram.alpha=1, onset.alpha=.9, ci.alpha = .3, spaghetti.line.alpha   =.2,
     spaghetti.label.alpha = .1, overlay = 1) + 
  labs(x = 'Period', y = 'Revenue', title = 'Estimation Results', 
       subtitle = paste0(te_est, ', ', CI, '.'), 
       caption = paste0('The number of treatment and control units: ', num_treated, ' and ', num_control, '.'))
ggsave('../figures/results.png')

fe <- feols(revenue~treat, dt, cluster = 'country', panel.id = 'country', 
      fixef = c('country', 'period'))
summary(fe)

Now that we understand more about SynthDiD let’s talk about pros and cons of this method.

There are some advantages and disadvantages to SynthDiD like every method. Here are some pros and cons to keep in mind when getting started with this method.

Advantages of SynthDiD method: The synthetic control method is usually used for a few treated and control units and needs long, balanced data before treatment. SynthDiD, on the other hand, works well even with a short data period before treatment, unlike the synthetic control method [4]. This method is being preferred especially because it doesn’t have a strict parallel trends assumption (PTA) requirement like DiD. SynthDiD guarantees a suitable quantity of control units, considers possible pre-intervention patterns, and may accommodate a degree of endogenous treatment timing [4]. Disadvantages of SynthDiD method: Can be computationally expensive (even with only one treated group/block). Requires a balanced panel (i.e., you can only use units observed for all time periods) and that the treatment timing is identical for all treated units. Requires enough pre-treatment periods for good estimation, so, if you don’t have enough pre-treatment period might be better to use just the regular DiD. Computing and comparing the average treatment effects for subgroups is tricky. One option is to split the sample into subgroups and compute the average treatment effects for each subgroup. Implementing SynthDiD where the treatment timing varies might be tricky. In the case of staggered treatment timing, as one solution, one can estimate the average treatment effect for each treatment cohort and then aggregate cohort-specific average treatment effects to an overall average treatment effects. Here are also some other points that you might want to know when getting started. Things to note: SynthDiD employs regularized ridge regression (L2) while ensuring that the resulting weights have a sum of one. In the process of pretreatment matching, SynthDiD tries to determine the average treatment effect across the entire sample. This approach might cause individual time period estimates to be less precise. Nonetheless, the overall average yields an unbiased evaluation. The standard errors for the treatment effects are estimated with jacknife or if a cohort has only one treated unit with placebo method. The estimator is considered consistent and asymptotically normal, given that the combination of the number of control units and pretreatment periods is sufficiently large relative to the combination of the number of treated units and posttreatment periods. In practice, pre-treatment variables play a minor role in Synthetic DiD, as lagged outcomes hold more predictive power, making the treatment of these variables less critical. Conclusion In this blog post, I introduce the SynthDiD method and discuss its relationship with traditional DiD and SCM. SynthDiD combines the strengths of both SCM and DiD, allowing for causal inference with large panels even when the pretreatment period is short. I demonstrate the method using the synthdid package in R. Although it has several advantages, such as not requiring a strict parallel trends assumption, it also has drawbacks, like being computationally expensive and requiring a balanced panel. Overall, SynthDiD is a valuable tool for researchers interested in estimating causal effects using observational data, providing an alternative to traditional DiD and SCM methods.

6.12 Doubly Robust Models in Econometrics

Doubly Robust (DR) Models are a class of estimators used to estimate causal effects, providing robustness against model misspecification. The key feature of DR models is that they combine elements of both outcome regression and propensity score methods. This dual approach ensures that the estimator remains consistent if at least one of the two models (outcome or treatment model) is correctly specified.

DRDID website

DRDID

Average Treatment Effect on the Treated (ATT) in Difference-in-Differences (DiD) setups where the parallel trends assumption holds after conditioning on a vector of pre-treatment covariates.

6.12.1 Key Concepts

  1. Outcome Model:
    • This involves modeling the outcome \(Y\) as a function of covariates \(X\) and treatment \(D\).
    • Example: Using a regression model \(E[Y | X, D]\).
  2. Treatment Model (Propensity Score Model):
    • This involves modeling the treatment assignment \(D\) as a function of covariates \(X\).
    • Example: Using logistic regression to estimate the propensity score \(P(D = 1 | X)\).
  3. Doubly Robust Estimator:
    • Combines the predictions from both the outcome and treatment models to estimate the average treatment effect (ATE).
    • The estimator is “doubly robust” because it remains unbiased if either the outcome model or the treatment model is correctly specified, but not necessarily both.

6.12.2 Steps in Doubly Robust Estimation

  1. Estimate the Propensity Score:
    • Use a logistic regression (or other suitable model) to estimate the probability of treatment given the covariates \(X\): \[ \hat{p}(X) = P(D = 1 | X) \]
  2. Estimate the Outcome Model:
    • Fit a regression model to estimate the expected outcome given covariates \(X\) and treatment \(D\): \[ \hat{E}[Y | X, D] \]
  3. Compute the Inverse Probability Weights (IPW):
    • Calculate the weights based on the estimated propensity scores: \[ W = \frac{D}{\hat{p}(X)} + \frac{1 - D}{1 - \hat{p}(X)} \]
  4. Calculate the Doubly Robust Estimator:
    • Combine the outcome model and the inverse probability weights to adjust the outcomes: \[ \hat{\theta}_{DR} = \frac{1}{n} \sum_{i=1}^n \left( \hat{E}[Y | X_i, D_i] + \frac{D_i (Y_i - \hat{E}[Y | X_i, D_i])}{\hat{p}(X_i)} - \frac{(1 - D_i) (Y_i - \hat{E}[Y | X_i, D_i])}{1 - \hat{p}(X_i)} \right) \]

6.12.3 Advantages

  1. Robustness:
    • The estimator is consistent if either the outcome model or the propensity score model is correctly specified.
  2. Efficiency:
    • It often has lower variance compared to using either the outcome model or propensity score model alone.
  3. Flexibility:
    • Can be applied in various settings, including observational studies and randomized experiments with imperfect compliance.

6.12.4 Examples and Applications

  1. Healthcare:
    • Estimating the effect of a new treatment on patient outcomes, where treatment assignment may depend on patient characteristics.
  2. Economics:
    • Evaluating the impact of job training programs on employment, accounting for non-random selection into the program.
  3. Education:
    • Assessing the effect of educational interventions, such as after-school tutoring programs, on student performance, considering potential confounding factors.

6.12.5 Assumptions and Considerations

  1. Consistency:
    • Assumes that the treatment and outcome models are correctly specified for the estimator to be unbiased.
  2. Overlap:
    • Requires that for every value of covariates \(X\), there is a positive probability of receiving both treatment and control (common support assumption).
  3. No Unmeasured Confounding:
    • Assumes that all confounders affecting both treatment and outcome are observed and correctly included in the models.

6.12.6 Conclusion

Doubly Robust models provide a powerful and flexible approach for causal inference in econometrics, offering robustness against model misspecification and improving efficiency. They are particularly useful in observational studies where the treatment assignment is not random, ensuring more reliable and credible estimates of causal effects.

6.13 Twoway Fixed Effects with Differential Timing

\(y_{it} = \alpha_0 + \delta D_{it} + X_{it} + \alpha_i + \alpha_t + \epsilon_{it}\)

When researchers estimate this regression these days, they usually use the linear fixed-effects model. These linear panel models have gotten the nickname “twoway fixed effects” because they include both time fixed effects and unit fixed effects.

6.14 Bacon Decomposition

The punchline of the Bacon decomposition theorem is that the twoway fixed effects estimator is a weighted average of all potential 2 x 2 DD estimates where weights are both based on group sizes and variance in treatment.

6.14.1 Overview

Bacon Decomposition is a method introduced by Goodman-Bacon (2018) for decomposing the overall treatment effect estimated by a Two-Way Fixed Effects (TWFE) regression model in the context of Difference-in-Differences (DiD) settings with variation in treatment timing. The key insight from this decomposition is that the TWFE estimate in such settings can be understood as a weighted average of all possible 2x2 DiD estimates that can be constructed from the data. This decomposition helps identify the sources of bias, especially when treatment effects are heterogeneous or when there are differential pre-treatment trends.

6.14.2 Key Concepts

  1. Two-Way Fixed Effects (TWFE) Models:
    • TWFE models are commonly used in DiD analyses to account for time-invariant differences between units and common shocks over time by including unit and time fixed effects.
    • The model typically looks like: \[ Y_{it} = \alpha_i + \lambda_t + \beta D_{it} + \epsilon_{it} \] where \(Y_{it}\) is the outcome for unit \(i\) at time \(t\), \(\alpha_i\) are unit fixed effects, \(\lambda_t\) are time fixed effects, \(D_{it}\) is the treatment indicator, and \(\beta\) is the treatment effect.
  2. Variation in Treatment Timing:
    • In many DiD applications, units receive treatment at different times rather than simultaneously. This leads to multiple possible comparisons between treated and control units at different points in time.
  3. Bacon Decomposition:
    • The decomposition breaks down the overall TWFE estimate into a weighted average of all possible 2x2 DiD estimates. Each of these estimates compares treated and untreated units in specific periods.
    • The decomposition reveals that the overall estimate is influenced by:
      • Comparisons between early-treated and late-treated units.
      • Comparisons between treated and untreated units at different times.
      • Comparisons within treated units (pre- and post-treatment).

6.14.3 Components of Bacon Decomposition

  1. Early vs. Late Treated Units:
    • Comparing units treated early with those treated later. This can introduce bias if there are differential trends among these groups.
  2. Treated vs. Untreated Units:
    • Standard DiD comparison where treated units are compared to untreated ones, assuming common trends between them.
  3. Within-Unit Comparisons:
    • Comparing outcomes within the same unit before and after treatment.

6.14.4 Formula for Decomposition

The overall TWFE estimate \(\hat{\beta}_{TWFE}\) can be decomposed as: \[ \hat{\beta}_{TWFE} = \sum_{k} w_k \hat{\beta}_k \] where \(\hat{\beta}_k\) are the 2x2 DiD estimates, and \(w_k\) are the weights that depend on the relative timing of treatment and the distribution of the treated and control units over time.

6.14.5 Implications and Interpretation

  1. Heterogeneous Treatment Effects:
    • When treatment effects vary over time or across units, the TWFE estimate can be biased. Bacon decomposition helps identify how much of the TWFE estimate is driven by comparisons that might be invalid due to treatment effect heterogeneity.
  2. Differential Pre-treatment Trends:
    • If treated and control units follow different pre-treatment trends, this can also bias the TWFE estimate. Bacon decomposition highlights which comparisons are most affected by such trends.
  3. Policy Implications:
    • Understanding the sources of bias through Bacon decomposition can inform better policy evaluations by revealing the need for more appropriate methods or robustness checks in the presence of staggered treatment adoption.

6.14.6 Example

Consider a study evaluating the impact of a new education policy implemented in different schools at different times. Using a TWFE model, the overall treatment effect might be estimated as: \[ \hat{\beta}_{TWFE} = 0.5 \]

Applying Bacon decomposition, we might find that: - Comparisons between schools treated in 2018 and those treated in 2020 contribute \(0.3\) to the estimate. - Comparisons between treated schools and untreated schools contribute \(0.1\). - Comparisons within schools before and after treatment contribute \(0.1\).

If early-treated schools experienced a different trend in outcomes compared to late-treated schools, this could explain the significant contribution from early vs. late comparisons, highlighting potential bias in the overall estimate.

6.14.7 Conclusion

Bacon decomposition provides a nuanced understanding of the TWFE estimates in DiD settings with staggered treatment adoption. By breaking down the overall estimate into its constituent comparisons, researchers can identify and address potential biases due to heterogeneous treatment effects and differential trends, leading to more accurate and reliable causal inferences.

6.14.7.1 Self Driving Cars Experiment

(Source)[https://matteocourthoud.github.io/post/synth/]

Suppose you were a ride-sharing platform and you wanted to test the effect of self-driving cars in your fleet.

As you can imagine, there are many limitations to running an AB/test for this type of feature. First of all, it’s complicated to randomize individual rides. Second, it’s a very expensive intervention. Third, and statistically most important, you cannot run this intervention at the ride level. The problem is that there are spillover effects from treated to control units: if indeed self-driving cars are more efficient, it means that they can serve more customers in the same amount of time, reducing the customers available to normal drivers (the control group). This spillover contaminates the experiment and prevents a causal interpretation of the results.

For all these reasons, we select only one city. Given the synthetic vibe of the article we cannot but select… (drum roll)… Miami!

We have information on the largest 46 U.S. cities for the period 2002-2019. The panel is balanced, which means that we observe all cities for all time periods. Self-driving cars were introduced in 2013.

As expected, the groups are not balanced: Miami is more densely populated, poorer, larger and has lower employment rate than the other cities in the US in our sample.

We are interested in understanding the impact of the introduction of self-driving cars on revenue.

One initial idea could be to analyze the data as we would in an A/B test, comparing control and treatment group. We can estimate the treatment effect as a difference in means in revenue between the treatment and control group, after the introduction of self-driving cars.