Chapter 6 Difference-in-Differences (DiD) Methods
Difference-in-Differences (DiD) is a quasi-experimental technique used in econometrics to estimate causal relationships. It compares the changes in outcomes over time between a treatment group and a control group.
Some resource links Comprehensive resource
Books
[Mathaeus - personal](Extra reading - python](https://matheusfacure.github.io/python-causality-handbook/13-Difference-in-Differences.htm)
6.1 Simple Difference-in-Differences (DiD)
Basic Idea: Difference-in-Differences (DiD) is a quasi-experimental design used in econometrics to estimate causal relationships. It compares the changes in outcomes over time between a treatment group and a control group.
Treatment assignment is not random, but we observe both treated and untreated units before and after treatment.
-Under certain structural assumptions, especially parallel outcome trends in the absence of treatment, we can recover the average treatment effect.
- Formula: The basic DiD estimator is:
\[ \text{DiD} = (\text{Y}_{\text{post-treatment, treatment group}} - \text{Y}_{\text{pre-treatment, treatment group}}) - (\text{Y}_{\text{post-treatment, control group}} - \text{Y}_{\text{pre-treatment, control group}}) \]
Concept:
DiD is used when we have data from before and after a treatment is applied to a treatment group, and we also have a control group that does not receive the treatment.
The key assumption is that in the absence of treatment, the difference between the treatment and control groups would have remained constant over time (parallel trends assumption).
Simple 2x2 DD collapses to true ATT when parallel trend holds true.
ATT can be calculated through differencing outcomes but regression can be used instead if we want to control for some more covariates.
if you need to avoid omitted variable bias through controlling for endogenous covariates that vary over time, then you may want to use regression. Such strategies are another way of saying that you will need to close some known critical backdoor.
Another reason for the equation is that by controlling for more appropriate covariates, you can reduce residual variance and improve the precision of your DD estimate.
Model:
\[ Y_{it} = \alpha + \beta_1 \text{Post}_t + \beta_2 \text{Treated}_i + \beta_3 (\text{Post}_t \times \text{Treated}_i) + \epsilon_{it} \]
where:
\(Y_{it}\) is the outcome variable for entity \(i\) at time \(t\).
\(\text{Post}_t\) is a dummy variable equal to 1 for periods after the treatment and 0 otherwise.
\(\text{Treated}_i\) is a dummy variable equal to 1 for the treatment group and 0 for the control group.
\(\beta_3\) is the DiD estimator, representing the treatment effect (ATT).
6.2 Controversial Note
The variables of interest in many of these setups only vary at a group level, such as the state, and outcome variables are often serially correlated. In Card and Krueger (1994), it is very likely for instance that employment in each state is not only correlated within the state but also serially correlated.
Bertrand, Duflo, and Mullainathan (2004) point out that the conventional standard errors often severely understate the standard deviation of the estimators, and so standard errors are biased downward, “too small,” and therefore overreject the null hypothesis. Bertrand, Duflo, and Mullainathan (2004) propose the following solutions:
Block bootstrapping standard errors.
Aggregating the data into one pre and one post period.
This approach ignores the time-series dimensions altogether, and if there is only one pre and post period and one untreated group, it’s as simple as it sounds.
- Clustering standard errors at the group level.
You simply adjust standard errors by clustering at the group level, as we discussed in the earlier chapter, or the level of treatment. For state-level panels, that would mean clustering at the state level, which allows for arbitrary serial correlation in errors within a state over time. This is the most common solution employed.
If number of groups is small, then you may use wild bootstrap technique, or randomization inference.
6.3 Placebo tests for parallel trends
We can test palcebo effects in the pre-treatment years to show in the pretreatment years both groups have similar trends. However, this may not prove that those groups will behave similarly after the treatment in the absence of treatment.
Just because they were similar before does not logically require they be the same after.
Likewise, we are not obligated to believe that that counterfactual trends would be the same post-treatment because they had been similar pre-treatment without further assumptions about the predictive power of pre-treatment trends.
But this is a nice attempt anyway.
While the test is important, technically pre-treatment similarities are neither necessary nor sufficient to guarantee parallel counterfactual trends (Kahn-Lang and Lang 2019).
Any DD is a combination of a comparison between the treatment and the never treated, an early treated compared to a late treated, and a late treated compared to an early treated. Thus only showing the comparison with the never treated is actually a misleading presentation of the underlying mechanization of identification using an twoway fixed-effects model with differential timing.
6.4 Two-Way Fixed Effects Model
Concept:
The two-way fixed effects model extends the simple DiD approach by controlling for time-invariant characteristics of the entities and common shocks over time.
It adds fixed effects for both entities and time periods to control for unobserved heterogeneity.
Model:
\[ Y_{it} = \alpha_0 + \beta_1\text{Treat}_i + \beta_2\text{Post}_t + \beta_3 (\text{Post}_t \times \text{Treat}_i) + \epsilon_{it} \]
where:
\(\beta_1\) represents entity fixed effects.
\(\beta_2\) represents time fixed effects.
\(\beta_3\) remains the DiD estimator.
Example: Using the job training program example, this model would account for fixed characteristics of individuals (such as inherent employability) and time-specific effects (such as economic conditions).
\[ Y_{it} = \alpha_i + \gamma_t + \beta_3 (\text{Post}_t \times \text{Treated}_i) + \epsilon_{it} \]
This controls for both individual-specific and time-specific unobserved heterogeneity, providing a more robust estimate of the treatment effect.
6.5 Event Study Methods
Concept:
Event studies extend DiD by examining the dynamics of the treatment effect over multiple periods before and after the treatment.
They allow for the estimation of treatment effects at different time points relative to the treatment event.
As with many contemporary DD designs, Miller et al. (2019) evaluate the pre-treatment leads instead of plotting the raw data by treatment and control. Post-estimation, they plotted regression coefficients with 95% confidence intervals on their treatment leads and lags. Including leads and lags into the DD model allowed the reader to check both the degree to which the post-treatment treatment effects were dynamic, and whether the two groups were comparable on outcome dynamics pre-treatment.
Typical Model:
\[ Y_{ist} = \alpha_s + \gamma_t + \sum_{x=-q}^{-1} \beta_x D_{sx} + \sum_{x=0}^{m} \delta_x D_{sx} + X_{ist} + \epsilon_{ist} \]
You include \(q\) leads or anticipatory effects and \(m\) lags or post-treatment effects.
6.6 Importance of Placebos in DD
It is a simple idea. For the minimum wage sttaudy, one candidate placebo falsification might simply be to use data for an alternative type of worker whose wages would not be affected by the binding minimum wage. This reasoning might lead us to consider the possibility that higher wage workers might function as a placebo.
Many people like to be straightforward and simply fit the same DD design using high wage employment as the outcome. If the coefficient on minimum wages is zero when using high wage worker employment as the outcome, but the coefficient on minimum wages for low wage workers is negative, then we have provided stronger evidence that complements the earlier analysis we did when on the low wage workers.
Another way to show placebo falsification. Triple DDD.
6.6.1 Triple Differences
\[ Y_{ijt} = \alpha + \beta_0X_{ist} + \beta_1\gamma_t + \beta_2\delta_j + \beta_3 D_i + \beta_4 (\delta . \gamma)_{jt} + \beta_5 (\gamma . D)_{ti} + \beta_6 (\delta . D)_{ij} + \beta_7 (\delta . \gamma . D)_{ijt} + \epsilon_{ijt} \]
where the parameter of interest is \(\beta_7\).
This requires a stacking of the data into a panel structure by group, as well as state. Second, the DDD model requires that you include all possible interactions across the group dummy \(\delta_j\), post-treatment dummy \(\gamma_t\) and treatment state dummy \(D_i\).
The regression must include each dummy independently, each individual interaction, and the triple differences interaction. One of these will be dropped due to multicollinearity, but I include them in the equation so that you can visualize all the factors used in the product of these terms.
6.7 Compositional Changes
DD can be applied to repeated cross-sections, as well as panel data. But one of the risks of working with the repeated cross-sections is that unlike panel data (e.g., individual-level panel data), repeated cross-sections run the risk of compositional changes.
This kind of compositional change is a like an omitted variable bias built into the sample itself caused by time-variant unobservables. Diffusion of the Internet appears to be related to changing samples as younger music fans are early adopters. Identification of causal effects would need for the treatment itself to be exogenous to such changes in the composition.
6.8 Key Assumptions
Parallel Trends Assumption: The treatment and control groups would have followed the same trend over time in the absence of the treatment. This is the most critical assumption.
Common Shocks: Both groups are assumed to be subject to the same external factors over time.
6.8.1 Implementation Steps
Identify Treatment and Control Groups: Clearly define which units are exposed to the treatment and which are not.
Collect Data: Obtain data on the outcome of interest for both groups before and after the treatment.
Check Parallel Trends: Visualize and statistically test if the pre-treatment trends of the groups are parallel.
Estimate the Model: Use regression analysis to estimate the DiD effect. The basic regression model is: \[ Y_{it} = \alpha + \beta_1 \text{Post}_t + \beta_2 \text{Treatment}_i + \beta_3 (\text{Post}_t \times \text{Treatment}_i) + \epsilon_{it} \] where \(\beta_3\) is the DiD estimator.
6.8.2 Advantages
Controls for Time-Invariant Differences: Differences between the treatment and control groups that do not change over time are accounted for.
Simple and Intuitive: The method is straightforward to understand and implement.
6.8.3 Limitations
Violation of Parallel Trends: If the parallel trends assumption is violated, the DiD estimate can be biased.
External Validity: The results are only valid for the sample and period studied.
Simultaneous Interventions: Other changes occurring simultaneously with the treatment can confound the results.
6.8.4 Q: How would you test the parallel trends assumption?
Visual Inspection Plot the outcome variable over time for both the treatment and control groups. If the trends are parallel before the intervention, it suggests that the parallel trends assumption holds.
Statistical Tests Conduct a regression test to formally check for parallel trends. This involves using only the pre-treatment data and checking if the interaction between time and treatment is statistically significant.
Steps:
- Restrict your data to pre-treatment periods.
- Regress the outcome on time, treatment, and their interaction.
- Check if the coefficient of the interaction term is statistically significant.
- Placebo Tests Conduct a placebo test by pretending that the treatment happened at a different time and check if you find a significant effect where none should exist.
Steps:
Choose a time period before the actual treatment period as the “placebo treatment period.” Perform a DiD analysis as if the treatment happened during this placebo period. Check for significant effects; finding none supports the parallel trends assumption.
- Event Study Analysis An event study involves plotting the estimated treatment effects at different time periods before and after the treatment to visually inspect if pre-treatment effects are close to zero.
Steps:
Create a series of dummy variables for each time period relative to the treatment. Regress the outcome on these time dummies and the interaction terms. Plot the coefficients of these interaction terms.
6.8.5 Q: How would you address potential violations of the parallel trends assumption?
- Pre-Treatment Trends Analysis
Before conducting the DiD analysis, carefully examine the pre-treatment trends. If the trends are not parallel, you might need to reconsider your groups or the methodology.
Visual Inspection: Plot the pre-treatment trends for the treatment and control groups. If they are not parallel, consider this a red flag.
Statistical Testing: Perform a formal test by regressing the outcome on a time indicator, treatment indicator, and their interaction using only pre-treatment data. A significant interaction term suggests non-parallel trends.
- Control for Covariates
Include control variables in your regression model to account for differences between the treatment and control groups that might affect the outcome variable.
- Collect relevant covariates that could influence the outcome.
- Include these covariates in your regression model: \[ Y_{it} = \alpha + \beta_1 \text{Post}_t + \beta_2 \text{Treatment}_i + \beta_3 (\text{Post}_t \times \text{Treatment}_i) + \gamma X_{it} + \epsilon_{it} \] where \(X_{it}\) represents the covariates.
- Matching
Use matching techniques to create a more comparable control group. Matching ensures that the treatment and control groups are similar in observed characteristics.
Propensity Score Matching (PSM): Match treatment units with control units based on the propensity score, which is the probability of receiving treatment given covariates.
Coarsened Exact Matching (CEM): Match units exactly on certain covariates.
- Synthetic Control Method
Construct a synthetic control group that closely resembles the treatment group in the pre-treatment period. This method is particularly useful when you have one treatment unit and many potential control units.
Select control units to construct a weighted combination (synthetic control) that mirrors the treatment unit’s pre-treatment characteristics.
Compare the post-treatment outcomes of the treatment unit with the synthetic control.
- Difference-in-Differences-in-Differences (DiDiD)
If you have an additional control group or variable, you can use DiDiD to control for potential violations. This method adds another layer of difference to control for unobserved confounders.
- Include a third group or dimension to add another difference. For example: \[ Y_{it} = \alpha + \beta_1 \text{Post}_t + \beta_2 \text{Treatment}_i + \beta_3 (\text{Post}_t \times \text{Treatment}_i) + \beta_4 \text{Group}_i + \beta_5 (\text{Group}_i \times \text{Post}_t) + \beta_6 (\text{Group}_i \times \text{Treatment}_i) + \beta_7 (\text{Group}_i \times \text{Post}_t \times \text{Treatment}_i) + \epsilon_{it} \] where \(\text{Group}_i\) represents the additional dimension.
- Sensitivity Analysis
Conduct sensitivity analyses to check how robust your results are to potential violations of the parallel trends assumption.
Placebo Tests: Perform DiD analysis using periods before the actual treatment to ensure no significant effects are detected.
Alternative Specifications: Use different model specifications or subsets of data to check the consistency of your results.
- Instrumental Variables (IV)
If you have a valid instrument, use it to address endogeneity issues that might violate the parallel trends assumption.
Identify an instrument that affects the treatment but not directly the outcome.
Use Two-Stage Least Squares (2SLS) to estimate the treatment effect.
By applying these strategies, you can address potential violations of the parallel trends assumption, ensuring more robust and credible results from your DiD analysis.
6.9 Notes
Bertrand, Duflo, and Mullainathan (2004) point out that conventional robust standard errors usually overestimate the actual standard deviation of the estimator. The authors recommend clustering the standard errors at the level of randomization (e.g. classes, counties, villages, …).
Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania (1994) by Card and Krueger.
6.10 Extra Considerations
- Two-way Fixed Effects (TWFE) model can give wrong estimates. This is very likely especially if treatments are heterogeneous (differential treatment timings, different treatment sizes, different treatment statuses over time) that can contaminate the treatment effects. This can result from “bad” treatment combinations biased the average treatment estimation to the point of even reversing the sign.
The new DiD methods “correct” for these TWFE biases by combining various estimation techniques, such as bootstrapping, inverse probability weights, matching, influence functions, and imputations, to handle parallel trends, negative weights, covariates, and controls.
6.11 Synthetic Difference-in-Differences (SynthDiD) method
SynthDiD is a generalized version of Synthetic Control Method (SCM) and DiD that combines the strengths of both methods. It enables causal inference with large panels, even with a short pretreatment period.
On the other hand, synthetic DiD combines the synthetic control method with the difference-in-differences approach [1]. In this method, a synthetic control group is constructed using the same approach as in the synthetic control method. However, the treatment effect is estimated by comparing the change in outcomes between the treated unit and the synthetic control group before and after the treatment is introduced. This approach allows for a more robust estimation of the treatment effect by accounting for pre-existing differences between the treatment and control groups.
In summary, while both methods use a synthetic control group, the synthetic control method estimates treatment effects by comparing the post-treatment outcomes of the treated unit to those of the synthetic control group, while synthetic DiD estimates treatment effects by comparing the change in outcomes between the treated unit and the synthetic control group before and after the treatment is introduced.
It constructs a counterfactual for the treated group by optimally weighting the control group units to minimize the difference between the treated and control groups in the pretreatment period as in SCM.
Then, the treatment effect is estimated by comparing the outcome changes in the treated unit and synthetic control group pre- and post-intervention as in DiD.
6.11.0.1 An Example:
Suppose that we are a company that sells plant-based food products, such as soy milk or soy yogurt, and we operate in multiple countries. Some countries implement new legislation that prohibits us from marketing our plant-based products as ‘milk’ or ‘yogurt’ because it is claimed that only animal products can be marketed as ‘milk’ or ‘yogurt’. Thus, due to this new regulation in some countries, we have to market soy milk as soy drink instead of soy milk, etc. We want to know the impact of this legislation on our revenue as this might help guide our lobbying efforts and marketing activities in different countries.
I simulated a balanced panel dataset that shows the revenue of our company in 30 different countries for 30 periods. Three of the countries implement this legislation in period 20. In the figure below, you can see a snapshot of the data. treat is a dummy variable indicating whether a country has implemented the legislation in a given period. revenueis the revenue in millions of EUR. You can find the simulation and estimation code in this Gist.
# Install and load the required packages
# devtools::install_github("synth-inference/synthdid")
library(synthdid)
library(ggplot2)
library(fixest) # Fixed-effects regression
library(data.table)
# Set seed for reproducibility
set.seed(12345)
source('sim_data.R') # Import simulation function and some utilities
dt <- sim_data()
head(dt)
In Data, there are 30 units (3 units treated), 30 periods (10 periods treated), all units are treated at the same time.
Next, we convert our panel data into a matrix required by the synthdid package. Given the outcome, treatment and control units and pretreatment periods, a synthetic control is created and treatment effect is estimated with synthdid_estimate function.
# Convert the data into a matrix
setup = panel.matrices(dt, unit = 'country', time = 'period',
outcome = 'revenue', treatment = 'treat')
# Estimate treatment effect using SynthDiD
tau.hat = synthdid_estimate(setup$Y,
setup$N0,
setup$T0)
print(summary(tau.hat))
To make inference, we also need to calculate the standard errors. I use jacknife method as I have more than one treated units. Placebo method is the only option if you have one treatment unit. Given the standard errors, I also calculate the 95% confidence interval for the treatment effect. I will report these in the figure below.
When there are multiple treated units (more than one unit that received the treatment or intervention), one common approach to estimating standard errors is using the jackknife method. The jackknife method is a resampling technique where each observation (in this case, each treated unit) is systematically omitted from the dataset, and the analysis is repeated each time to estimate the variance of the treatment effect. This provides a robust estimate of the standard errors that accounts for the potential variability across different treated units.
On the other hand, if there is only one treated unit (a single unit that received the treatment), using the jackknife method becomes impractical because there are not enough units to systematically leave out and still perform meaningful resampling. In such cases, the placebo method becomes a viable option.
The placebo method involves creating placebo or synthetic treated units that mimic the characteristics of the treated unit but did not actually receive the treatment. By comparing the outcomes of the actual treated unit with those of the synthetic placebo units, researchers can estimate the variability and potential impact of the treatment effect more accurately.
Therefore, the choice between the jackknife method and the placebo method depends on the number of treated units available for analysis within the synthetic control framework. Multiple treated units allow for the application of the jackknife method, whereas a single treated unit necessitates the use of the placebo method to estimate standard errors and make reliable inferences about the treatment effect.
# Calculate standard errors
se = sqrt(vcov(tau.hat, method='jackknife'))
te_est <- sprintf('Point estimate for the treatment effect: %1.2f', tau.hat)
CI <- sprintf('95%% CI (%1.2f, %1.2f)', tau.hat - 1.96 * se, tau.hat + 1.96 * se)
# Plot treatment effect estimates
plot(tau.hat)
plot(tau.hat, se.method='jackknife')
In the image below, the estimation results are displayed. Observe how the treated countries and the synthetic control exhibit fairly parallel trends on average (it might not look like a perfect parallel trends but that is not necessary for the sake of this example). The average for treated countries is more variable, primarily due to the presence of only three such countries, resulting in less smooth trends. Transparent gray lines represent different control countries. Following the treatment in period 20, a decline in revenue is observed in the treated countries, estimated to be 0.51 million EUR as indicated in the graph. This means that the new regulation has a negative impact on our company’s revenues and necessary actions should be taken to prevent further declines.
# Check the number of treatment and control countries to report
num_treated <- length(unique(dt[treat==1]$country))
num_control <- length(unique(dt$country))-num_treated
# Create spaghetti plot with top 10 control units
top.controls = synthdid_controls(tau.hat)[1:10, , drop=FALSE]
plot(tau.hat, spaghetti.units=rownames(top.controls),
trajectory.linetype = 1, line.width=.75,
trajectory.alpha=.9, effect.alpha=.9,
diagram.alpha=1, onset.alpha=.9, ci.alpha = .3, spaghetti.line.alpha =.2,
spaghetti.label.alpha = .1, overlay = 1) +
labs(x = 'Period', y = 'Revenue', title = 'Estimation Results',
subtitle = paste0(te_est, ', ', CI, '.'),
caption = paste0('The number of treatment and control units: ', num_treated, ' and ', num_control, '.'))
Let’s plot the weights use to estimate the synthetic control.
# Plot control unit contributions
synthdid_units_plot(tau.hat, se.method='jackknife') +
labs(x = 'Country', y = 'Treatment effect',
caption = 'The black horizontal line shows the actual effect;
the gray ones show the endpoints of a 95% confidence interval.')
ggsave('../figures/unit_weights.png')
In the image below, you can observe how each country is weighted to construct the synthetic control. The treatment effects differ based on the untreated country selected as the control unit.
# Check for pre-treatment parallel trends
plot(tau.hat, overlay=1, se.method='jackknife')
ggsave('../figures/results_simple.png')
# Check the number of treatment and control countries to report
num_treated <- length(unique(dt[treat==1]$country))
num_control <- length(unique(dt$country))-num_treated
# Create spaghetti plot with top 10 control units
top.controls = synthdid_controls(tau.hat)[1:10, , drop=FALSE]
plot(tau.hat, spaghetti.units=rownames(top.controls),
trajectory.linetype = 1, line.width=.75,
trajectory.alpha=.9, effect.alpha=.9,
diagram.alpha=1, onset.alpha=.9, ci.alpha = .3, spaghetti.line.alpha =.2,
spaghetti.label.alpha = .1, overlay = 1) +
labs(x = 'Period', y = 'Revenue', title = 'Estimation Results',
subtitle = paste0(te_est, ', ', CI, '.'),
caption = paste0('The number of treatment and control units: ', num_treated, ' and ', num_control, '.'))
ggsave('../figures/results.png')
fe <- feols(revenue~treat, dt, cluster = 'country', panel.id = 'country',
fixef = c('country', 'period'))
summary(fe)
Now that we understand more about SynthDiD let’s talk about pros and cons of this method.
There are some advantages and disadvantages to SynthDiD like every method. Here are some pros and cons to keep in mind when getting started with this method.
Advantages of SynthDiD method: The synthetic control method is usually used for a few treated and control units and needs long, balanced data before treatment. SynthDiD, on the other hand, works well even with a short data period before treatment, unlike the synthetic control method [4]. This method is being preferred especially because it doesn’t have a strict parallel trends assumption (PTA) requirement like DiD. SynthDiD guarantees a suitable quantity of control units, considers possible pre-intervention patterns, and may accommodate a degree of endogenous treatment timing [4]. Disadvantages of SynthDiD method: Can be computationally expensive (even with only one treated group/block). Requires a balanced panel (i.e., you can only use units observed for all time periods) and that the treatment timing is identical for all treated units. Requires enough pre-treatment periods for good estimation, so, if you don’t have enough pre-treatment period might be better to use just the regular DiD. Computing and comparing the average treatment effects for subgroups is tricky. One option is to split the sample into subgroups and compute the average treatment effects for each subgroup. Implementing SynthDiD where the treatment timing varies might be tricky. In the case of staggered treatment timing, as one solution, one can estimate the average treatment effect for each treatment cohort and then aggregate cohort-specific average treatment effects to an overall average treatment effects. Here are also some other points that you might want to know when getting started. Things to note: SynthDiD employs regularized ridge regression (L2) while ensuring that the resulting weights have a sum of one. In the process of pretreatment matching, SynthDiD tries to determine the average treatment effect across the entire sample. This approach might cause individual time period estimates to be less precise. Nonetheless, the overall average yields an unbiased evaluation. The standard errors for the treatment effects are estimated with jacknife or if a cohort has only one treated unit with placebo method. The estimator is considered consistent and asymptotically normal, given that the combination of the number of control units and pretreatment periods is sufficiently large relative to the combination of the number of treated units and posttreatment periods. In practice, pre-treatment variables play a minor role in Synthetic DiD, as lagged outcomes hold more predictive power, making the treatment of these variables less critical. Conclusion In this blog post, I introduce the SynthDiD method and discuss its relationship with traditional DiD and SCM. SynthDiD combines the strengths of both SCM and DiD, allowing for causal inference with large panels even when the pretreatment period is short. I demonstrate the method using the synthdid package in R. Although it has several advantages, such as not requiring a strict parallel trends assumption, it also has drawbacks, like being computationally expensive and requiring a balanced panel. Overall, SynthDiD is a valuable tool for researchers interested in estimating causal effects using observational data, providing an alternative to traditional DiD and SCM methods.
6.12 Doubly Robust Models in Econometrics
Doubly Robust (DR) Models are a class of estimators used to estimate causal effects, providing robustness against model misspecification. The key feature of DR models is that they combine elements of both outcome regression and propensity score methods. This dual approach ensures that the estimator remains consistent if at least one of the two models (outcome or treatment model) is correctly specified.
DRDID
Average Treatment Effect on the Treated (ATT) in Difference-in-Differences (DiD) setups where the parallel trends assumption holds after conditioning on a vector of pre-treatment covariates.
6.12.1 Key Concepts
- Outcome Model:
- This involves modeling the outcome \(Y\) as a function of covariates \(X\) and treatment \(D\).
- Example: Using a regression model \(E[Y | X, D]\).
- Treatment Model (Propensity Score Model):
- This involves modeling the treatment assignment \(D\) as a function of covariates \(X\).
- Example: Using logistic regression to estimate the propensity score \(P(D = 1 | X)\).
- Doubly Robust Estimator:
- Combines the predictions from both the outcome and treatment models to estimate the average treatment effect (ATE).
- The estimator is “doubly robust” because it remains unbiased if either the outcome model or the treatment model is correctly specified, but not necessarily both.
6.12.2 Steps in Doubly Robust Estimation
- Estimate the Propensity Score:
- Use a logistic regression (or other suitable model) to estimate the probability of treatment given the covariates \(X\): \[ \hat{p}(X) = P(D = 1 | X) \]
- Estimate the Outcome Model:
- Fit a regression model to estimate the expected outcome given covariates \(X\) and treatment \(D\): \[ \hat{E}[Y | X, D] \]
- Compute the Inverse Probability Weights (IPW):
- Calculate the weights based on the estimated propensity scores: \[ W = \frac{D}{\hat{p}(X)} + \frac{1 - D}{1 - \hat{p}(X)} \]
- Calculate the Doubly Robust Estimator:
- Combine the outcome model and the inverse probability weights to adjust the outcomes: \[ \hat{\theta}_{DR} = \frac{1}{n} \sum_{i=1}^n \left( \hat{E}[Y | X_i, D_i] + \frac{D_i (Y_i - \hat{E}[Y | X_i, D_i])}{\hat{p}(X_i)} - \frac{(1 - D_i) (Y_i - \hat{E}[Y | X_i, D_i])}{1 - \hat{p}(X_i)} \right) \]
6.12.3 Advantages
- Robustness:
- The estimator is consistent if either the outcome model or the propensity score model is correctly specified.
- Efficiency:
- It often has lower variance compared to using either the outcome model or propensity score model alone.
- Flexibility:
- Can be applied in various settings, including observational studies and randomized experiments with imperfect compliance.
6.12.4 Examples and Applications
- Healthcare:
- Estimating the effect of a new treatment on patient outcomes, where treatment assignment may depend on patient characteristics.
- Economics:
- Evaluating the impact of job training programs on employment, accounting for non-random selection into the program.
- Education:
- Assessing the effect of educational interventions, such as after-school tutoring programs, on student performance, considering potential confounding factors.
6.12.5 Assumptions and Considerations
- Consistency:
- Assumes that the treatment and outcome models are correctly specified for the estimator to be unbiased.
- Overlap:
- Requires that for every value of covariates \(X\), there is a positive probability of receiving both treatment and control (common support assumption).
- No Unmeasured Confounding:
- Assumes that all confounders affecting both treatment and outcome are observed and correctly included in the models.
6.12.6 Conclusion
Doubly Robust models provide a powerful and flexible approach for causal inference in econometrics, offering robustness against model misspecification and improving efficiency. They are particularly useful in observational studies where the treatment assignment is not random, ensuring more reliable and credible estimates of causal effects.
6.13 Twoway Fixed Effects with Differential Timing
\(y_{it} = \alpha_0 + \delta D_{it} + X_{it} + \alpha_i + \alpha_t + \epsilon_{it}\)
When researchers estimate this regression these days, they usually use the linear fixed-effects model. These linear panel models have gotten the nickname “twoway fixed effects” because they include both time fixed effects and unit fixed effects.
6.14 Bacon Decomposition
The punchline of the Bacon decomposition theorem is that the twoway fixed effects estimator is a weighted average of all potential 2 x 2 DD estimates where weights are both based on group sizes and variance in treatment.
6.14.1 Overview
Bacon Decomposition is a method introduced by Goodman-Bacon (2018) for decomposing the overall treatment effect estimated by a Two-Way Fixed Effects (TWFE) regression model in the context of Difference-in-Differences (DiD) settings with variation in treatment timing. The key insight from this decomposition is that the TWFE estimate in such settings can be understood as a weighted average of all possible 2x2 DiD estimates that can be constructed from the data. This decomposition helps identify the sources of bias, especially when treatment effects are heterogeneous or when there are differential pre-treatment trends.
6.14.2 Key Concepts
- Two-Way Fixed Effects (TWFE) Models:
- TWFE models are commonly used in DiD analyses to account for time-invariant differences between units and common shocks over time by including unit and time fixed effects.
- The model typically looks like: \[ Y_{it} = \alpha_i + \lambda_t + \beta D_{it} + \epsilon_{it} \] where \(Y_{it}\) is the outcome for unit \(i\) at time \(t\), \(\alpha_i\) are unit fixed effects, \(\lambda_t\) are time fixed effects, \(D_{it}\) is the treatment indicator, and \(\beta\) is the treatment effect.
- Variation in Treatment Timing:
- In many DiD applications, units receive treatment at different times rather than simultaneously. This leads to multiple possible comparisons between treated and control units at different points in time.
- Bacon Decomposition:
- The decomposition breaks down the overall TWFE estimate into a weighted average of all possible 2x2 DiD estimates. Each of these estimates compares treated and untreated units in specific periods.
- The decomposition reveals that the overall estimate is influenced by:
- Comparisons between early-treated and late-treated units.
- Comparisons between treated and untreated units at different times.
- Comparisons within treated units (pre- and post-treatment).
6.14.3 Components of Bacon Decomposition
- Early vs. Late Treated Units:
- Comparing units treated early with those treated later. This can introduce bias if there are differential trends among these groups.
- Treated vs. Untreated Units:
- Standard DiD comparison where treated units are compared to untreated ones, assuming common trends between them.
- Within-Unit Comparisons:
- Comparing outcomes within the same unit before and after treatment.
6.14.4 Formula for Decomposition
The overall TWFE estimate \(\hat{\beta}_{TWFE}\) can be decomposed as: \[ \hat{\beta}_{TWFE} = \sum_{k} w_k \hat{\beta}_k \] where \(\hat{\beta}_k\) are the 2x2 DiD estimates, and \(w_k\) are the weights that depend on the relative timing of treatment and the distribution of the treated and control units over time.
6.14.5 Implications and Interpretation
- Heterogeneous Treatment Effects:
- When treatment effects vary over time or across units, the TWFE estimate can be biased. Bacon decomposition helps identify how much of the TWFE estimate is driven by comparisons that might be invalid due to treatment effect heterogeneity.
- Differential Pre-treatment Trends:
- If treated and control units follow different pre-treatment trends, this can also bias the TWFE estimate. Bacon decomposition highlights which comparisons are most affected by such trends.
- Policy Implications:
- Understanding the sources of bias through Bacon decomposition can inform better policy evaluations by revealing the need for more appropriate methods or robustness checks in the presence of staggered treatment adoption.
6.14.6 Example
Consider a study evaluating the impact of a new education policy implemented in different schools at different times. Using a TWFE model, the overall treatment effect might be estimated as: \[ \hat{\beta}_{TWFE} = 0.5 \]
Applying Bacon decomposition, we might find that: - Comparisons between schools treated in 2018 and those treated in 2020 contribute \(0.3\) to the estimate. - Comparisons between treated schools and untreated schools contribute \(0.1\). - Comparisons within schools before and after treatment contribute \(0.1\).
If early-treated schools experienced a different trend in outcomes compared to late-treated schools, this could explain the significant contribution from early vs. late comparisons, highlighting potential bias in the overall estimate.
6.14.7 Conclusion
Bacon decomposition provides a nuanced understanding of the TWFE estimates in DiD settings with staggered treatment adoption. By breaking down the overall estimate into its constituent comparisons, researchers can identify and address potential biases due to heterogeneous treatment effects and differential trends, leading to more accurate and reliable causal inferences.
6.14.7.1 Self Driving Cars Experiment
(Source)[https://matteocourthoud.github.io/post/synth/]
Suppose you were a ride-sharing platform and you wanted to test the effect of self-driving cars in your fleet.
As you can imagine, there are many limitations to running an AB/test for this type of feature. First of all, it’s complicated to randomize individual rides. Second, it’s a very expensive intervention. Third, and statistically most important, you cannot run this intervention at the ride level. The problem is that there are spillover effects from treated to control units: if indeed self-driving cars are more efficient, it means that they can serve more customers in the same amount of time, reducing the customers available to normal drivers (the control group). This spillover contaminates the experiment and prevents a causal interpretation of the results.
For all these reasons, we select only one city. Given the synthetic vibe of the article we cannot but select… (drum roll)… Miami!
We have information on the largest 46 U.S. cities for the period 2002-2019. The panel is balanced, which means that we observe all cities for all time periods. Self-driving cars were introduced in 2013.
As expected, the groups are not balanced: Miami is more densely populated, poorer, larger and has lower employment rate than the other cities in the US in our sample.
We are interested in understanding the impact of the introduction of self-driving cars on revenue.
One initial idea could be to analyze the data as we would in an A/B test, comparing control and treatment group. We can estimate the treatment effect as a difference in means in revenue between the treatment and control group, after the introduction of self-driving cars.