Chapter 1 Causal Models

Hypothesis Testing
Experiments
Difference in Differences
Synthetic Control
Resampling Techniques

1.1 Concepts

1.1.1 Goodness of Fit

I would encourage you not to fixate on R-squared in research projects where the aim is to estimate some causal effect, though. It’s a useful summary measure, but it does not tell us about causality. Remember, you aren’t trying to explain variation in $y$ if you are trying to estimate some causal effect. The $R^2$ tells us how much of the variation in $y$ is explained by the explanatory variables. But if we are interested in the causal effect of a single variable, $R^2$ is irrelevant

1.1.2 Robustness checks and validation methods

Robustness checks and validation methods are essential aspects of evaluating the reliability and credibility of empirical research findings, including those derived from the Synthetic Control Method (SCM). Although they are related and sometimes overlap, they serve distinct purposes in the research process. Here’s a detailed explanation of the differences between them and why each is important:

1.1.2.1 Robustness Checks

Definition: Robustness checks are procedures used to assess the sensitivity and stability of research findings to various assumptions, model specifications, and data perturbations. The goal is to determine whether the results hold under different conditions and to identify any potential weaknesses in the analysis.

Purpose:

Assess Stability: Ensure that the findings are not unduly influenced by specific choices made in the analysis (e.g., selection of control units, predictor variables).
Identify Key Drivers: Determine which aspects of the model or data are most influential in driving the results.
Evaluate Generalizability: Check whether the results are consistent across different sub-samples or alternative model specifications.

1.1.2.2 Validation Methods

Definition: Validation methods are procedures used to confirm that the analytical approach and findings are credible and correctly specified. The goal is to ensure that the methodology accurately captures the causal relationship of interest and that the results are not artifacts of methodological flaws.

Purpose:

Establish Credibility: Demonstrate that the research design and methods are sound and that the findings are credible.
Detect Biases: Identify and correct any biases or errors in the analysis that could distort the results.
Provide Evidence for Causal Claims: Strengthen the argument that the observed effects are truly caused by the intervention rather than other factors.

1.1.2.3 Differences Between Robustness Checks and Validation Methods

Scope and Focus:

Robustness Checks: Focus on testing the sensitivity of the results to various assumptions and choices within the study. They address questions like “Do the results change if we tweak the model or data in specific ways?”
Validation Methods: Focus on verifying the correctness and credibility of the methodology and findings. They address questions like “Is the methodology sound and are the findings believable?”

When Applied:

Robustness Checks: Often applied after the main analysis to ensure the findings are not artifacts of specific decisions or assumptions.
Validation Methods: Applied throughout the research process to ensure that the approach is valid and the results are credible from the outset.

Why Both Are Important:

Robustness Checks:
- Credibility: Helps build confidence that the results are not fragile or overly dependent on specific conditions.
- Transparency: Provides a clear understanding of how various factors influence the findings.
- Comprehensive Insight: Identifies which components of the analysis are most crucial and robust.
Validation Methods:
- Reliability: Ensures that the methodological approach is correct and that the findings are not due to methodological flaws.
- Accuracy: Confirms that the causal claims are well-founded and not spurious.
- Scientific Rigor: Strengthens the overall validity of the research by providing multiple lines of evidence supporting the findings.

Conclusion

Robustness checks and validation methods are complementary approaches that together enhance the credibility and reliability of research findings. Robustness checks focus on the sensitivity and stability of the results, while validation methods ensure the correctness and credibility of the methodology. Both are crucial for demonstrating that the findings are both reliable and valid, thereby providing a comprehensive evaluation of the research’s strength and integrity.

1.2 Directed Acyclic Graphs (DAGs)

causality runs in one direction, it runs forward in time.
There are no cycles in a DAG. To show reverse causality, one would need to create multiple nodes, most likely with two versions of the same node separated by a time index.
To handle either simultaneity or reverse causality, it is recommended that you take a completely different approach to the problem than the one presented in this chapter.
DAGs explain causality in terms of counterfactuals. That is, a causal effect is defined as a comparison between two states of the world—one state that actually happened when some intervention took on some value and another state that didn’t happen (the “counterfactual”) under some other intervention.
Arrows represent a causal effect between two random variables moving in the intuitive direction of the arrow. The direction of the arrow captures the direction of causality.
Causal effects can happen in two ways. They can either be direct (e.g., D -> Y), or they can be mediated by a third variable (e.g., D -> X -> Y). When they are mediated by a third variable, we are capturing a sequence of events originating with , which may or may not be important to you depending on the question you’re asking.
A complete DAG will have all direct causal effects among the variables in the graph as well as all common causes of any pair of variables in the graph.

1.2.1 Confounder

Direct path is causal: D -> Y;

Backdoor path is not causal: X -> D and X -> Y

Backdoor path, it is a process that creates spurious correlations between D and Y that are driven solely by fluctuations in the X random variable.
Therefore, not controlling for a variable like that in a regression creates omitted variable bias, leaving a backdoor open creates bias.
We therefore call X a confounder because it jointly determines D and Y, and so confounds our ability to discern the effect of D on Y in naı̈ve comparisons.

Think of the backdoor path like this: Sometimes when D takes on different values, Y takes on different values because D causes Y. But sometimes D and Y take on different values because X takes on different values, and that bit of the correlation between D and Y is purely spurious. The existence of two causal pathways is contained within the correlation between D and Y.

When X is observed and put in the model, then the backdoor path is closed.

1.2.2 Collider

Direct path is causal: D -> Y;

Backdoor path is not causal: D -> X and Y -> X

Like above, there are two ways to get to Y from D.
X is a collider along this backdoor path because D and the causal effects of Y collide at X.
Colliders are special in part because when they appear along a backdoor path, that backdoor path is closed simply because of their presence.

1.2.3 What to do

Open backdoor paths introduce omitted variable bias and, the bias is so bad that it flips the sign entirely.
Our goal is to close these backdoor paths.
And if we can close all of the open backdoor paths, then we can isolate the causal effect of D on Y using one of the research designs and identification strategies discussed in this book.

1.2.4 How to do

First, if you have a confounder that has created an open backdoor path, then you can close that path by conditioning on the confounder. Conditioning requires holding the variable fixed using something like subclassification, matching, regression, or another method. It is equivalent to “controlling for” the variable in a regression.
The second way to close a backdoor path is the appearance of a collider along that backdoor path. Since colliders always close backdoor paths, and conditioning on a collider always opens a backdoor path, choosing to ignore the colliders is part of your overall strategy to estimate the causal effect itself. By not conditioning on a collider, you will have closed that backdoor path and that takes you closer to your larger ambition to isolate some causal effect.

Backdoor Criterian: if there is a confounder, then control for it; if there is a collider, then keep it outside your model

Avoid controlling for a collider in your model. To identify a collider, carefully analyze the directions and relationships between variables. If a variable acts as a collider (i.e., it is influenced by both the treatment and the outcome), including it in your model can introduce bias. Therefore, do not include colliders in your analysis.

Sample Selection and collider bias

lets assume that ability and beauty is independent, but both required for being an star actor.

It can be shown in a simulation that the collider bias has created a negative correlation between talent and beauty in the non-movie-star sample as well. Yet we know that there is in fact no relationship between the two variables. This kind of sample selection creates spurious correlations.

A random sample of the full population would be sufficient to show that there is no relationship between the two variables, but splitting the sample into movie stars only, we introduce spurious correlations between the two variables of interest.

1.3 Bad Controls

Joshua Angrist, a prominent economist known for his work in econometrics, discusses “bad controls” in the context of causal inference. “Bad controls” are variables that, when included in a regression model, can introduce bias rather than help control for it. Here are the key points on how Angrist addresses “bad controls”:

1.3.1 Key Points on “Bad Controls” by Joshua Angrist:

Definition of Bad Controls:
- Bad controls are variables that are themselves affected by the treatment or are post-treatment variables. Including these in your model can distort the causal relationship between the treatment and the outcome.
- They can also be variables that are endogenous, meaning they are correlated with the error term, leading to biased and inconsistent estimates.
Examples of Bad Controls:
- Variables that are outcomes of the treatment: If a variable is influenced by the treatment, including it as a control can create spurious correlations.
- Colliders: Variables that are influenced by both the treatment and the outcome. Controlling for colliders can open a backdoor path, leading to biased estimates.
Why Bad Controls are Problematic:
- Including bad controls can lead to incorrect inferences about the causal effect of the treatment.
- They can introduce bias by creating or amplifying spurious relationships.
Identifying Good Controls:
- Good controls are variables that help to isolate the causal effect by accounting for confounding factors.
- These are typically pre-treatment variables that influence the outcome but are not influenced by the treatment.
Best Practices:
- Focus on pre-treatment variables that are potential confounders: Variables that affect both the treatment and the outcome but are not affected by the treatment.
- Use robustness checks to ensure that the inclusion of controls does not unduly influence the estimates.

1.3.2 Example from Angrist and Pischke’s “Mostly Harmless Econometrics”:

In “Mostly Harmless Econometrics,” Angrist and Pischke illustrate these concepts with practical examples. For instance, they explain how including a variable like “post-treatment earnings” in a model where the treatment is “education level” can be a bad control. This is because earnings are influenced by education (the treatment), and controlling for it can obscure the true effect of education on other outcomes.

1.3.3 Practical Advice:

When building your regression model, carefully consider whether each control variable is a confounder, a mediator, or a collider.
Avoid including variables that lie on the causal path between the treatment and the outcome (mediators).
Avoid controlling for variables that are outcomes of the treatment or are influenced by both the treatment and the outcome (colliders).

1.3.4 Summary:

Joshua Angrist emphasizes the importance of identifying and avoiding bad controls in regression models to ensure unbiased causal inference. By focusing on appropriate pre-treatment controls and being wary of endogenous variables and post-treatment variables, researchers can make more accurate and reliable causal claims.

1.3.5 Unobserved Variable Affecting Only the Dependent Variable

If you have an unobserved variable that affects only the dependent variable and not the independent variables, the primary concern is increased variability in the error term, but it does not bias the coefficient estimates of the independent variables. Here’s a more detailed explanation:

No Endogeneity Problem: Since the unobserved variable does not affect the independent variables, it does not create a correlation between the independent variables and the error term. Hence, it does not introduce endogeneity, and the OLS estimates of the coefficients remain unbiased and consistent.
Increased Variance in Error Term: The presence of an unobserved variable affecting only the dependent variable will increase the variability (variance) of the error term. This leads to less precise (more variable) estimates of the coefficients, but these estimates are still unbiased.
Standard Errors: Due to the increased variability in the error term, the standard errors of the estimated coefficients will be larger, resulting in wider confidence intervals and potentially less statistical power to detect significant effects.

In summary, while the presence of such an unobserved variable does not introduce bias into the coefficient estimates, it affects the precision of these estimates, leading to larger standard errors.

1.4 External and Internal Validity in Econometrics

1.4.1 Internal Validity

Internal validity refers to the extent to which a study accurately establishes a causal relationship between the treatment (independent variable) and the outcome (dependent variable) within the context of the study. In other words, it measures how well the study avoids biases and errors that can lead to incorrect conclusions about causal relationships.

1.4.1.1 Key Points on Internal Validity:

Causal Inference: Internal validity ensures that the observed effects on the outcome can be confidently attributed to the treatment and not to other confounding factors.
Elimination of Bias: It involves controlling for confounding variables, avoiding omitted variable bias, ensuring proper randomization, and addressing issues like measurement error and simultaneity bias.
Common Threats:
- Confounding Variables: Variables that are correlated with both the treatment and the outcome.
- Selection Bias: Non-random assignment of subjects to treatment and control groups.
- Measurement Error: Inaccurate measurement of variables.
- Attrition: Loss of participants during the study.
- Reverse Causality: Difficulty in determining the direction of causality.
- Omitted Variable Bias: Failing to include a relevant variable that affects both the treatment and the outcome.

1.4.1.2 Examples in Econometrics:

Randomized Controlled Trials (RCTs): RCTs are considered the gold standard for internal validity because random assignment of treatment ensures that confounding variables are evenly distributed across treatment and control groups.
Instrumental Variables (IV): Using instruments to address endogeneity helps ensure that the treatment effect is not biased by omitted variables or reverse causality.

1.4.2 External Validity

External validity refers to the extent to which the results of a study can be generalized beyond the specific context of the study to other settings, populations, times, and circumstances. It measures the applicability of the study’s findings to real-world scenarios outside the study environment.

1.4.2.1 Key Points on External Validity:

Generalizability: Ensures that the conclusions drawn from the study sample can be applied to the broader population or different contexts.
Population Validity: The degree to which the study sample represents the target population.
Ecological Validity: The extent to which study findings can be generalized to other settings or environments.
Temporal Validity: Whether the results hold over different time periods.

1.4.2.2 Common Threats to External Validity:

Non-representative Samples: If the study sample is not representative of the target population, the findings may not be generalizable.
Specific Contexts: Results from a specific geographic location, industry, or demographic may not apply elsewhere.
Temporal Changes: Changes over time in technology, behavior, or policy can limit the generalizability of findings from past studies.

1.4.2.3 Examples in Econometrics:

Field Experiments: Conducting experiments in real-world settings can enhance external validity compared to laboratory experiments.
Replication Studies: Replicating studies in different contexts and with different populations helps assess the robustness and generalizability of findings.
Heterogeneous Treatment Effects: Analyzing how treatment effects vary across different subgroups can provide insights into the external validity of the findings.

1.4.3 Balancing Internal and External Validity

There is often a trade-off between internal and external validity:

High Internal Validity: Studies with strong internal validity, such as RCTs, often have controlled environments that may limit generalizability.
High External Validity: Observational studies and natural experiments might have higher external validity because they are conducted in real-world settings, but they may suffer from issues related to internal validity due to uncontrolled confounding variables.

1.4.3.1 Practical Considerations:

Study Design: Carefully design studies to address both internal and external validity. For instance, use randomization to enhance internal validity while selecting a representative sample to improve external validity.
Mixed Methods: Combining different methodological approaches, such as RCTs for internal validity and observational studies for external validity, can provide a more comprehensive understanding of causal relationships.
Transparency and Replication: Ensure transparency in research design and analysis, and encourage replication studies to verify findings across different contexts and populations.

By understanding and addressing both internal and external validity, researchers can produce more reliable and applicable econometric analyses that contribute to evidence-based decision-making.

1.5 Endogeneity

Endogeneity refers to a situation in econometrics where an explanatory variable is correlated with the error term in a regression model. This correlation can lead to biased and inconsistent estimates of the coefficients, making it difficult to establish causal relationships.

1.5.1 Sources of Endogeneity:

Omitted Variable Bias: When a relevant variable that affects both the dependent and independent variables is left out of the model, its effect is captured by the error term, leading to endogeneity.
Measurement Error: Errors in measuring the independent variable can cause it to be correlated with the error term.
Simultaneity (Reverse Causality): When the independent variable and the dependent variable mutually influence each other, leading to a two-way causation.

1.5.2 Consequences of Endogeneity:

Biased Estimates: The estimated coefficients do not accurately reflect the true relationship between the variables.
Inconsistent Estimates: As the sample size increases, the estimates do not converge to the true population parameters.

1.5.3 Methods to Address Endogeneity:

Instrumental Variables (IV):
- Instrumental Variables: Use variables (instruments) that are correlated with the endogenous explanatory variable but uncorrelated with the error term.
- Two-Stage Least Squares (2SLS): First stage involves regressing the endogenous variable on the instruments. The second stage uses the predicted values from the first stage as the independent variable in the main regression.
Fixed Effects Models:
- Panel Data: Use fixed effects to control for time-invariant unobserved heterogeneity.
- Difference-in-Differences (DiD): Control for unobserved confounding by comparing changes over time between treatment and control groups.
Control Function Approach:

Include the residuals from the first stage regression of the endogenous variable on instruments in the second stage regression to account for endogeneity.

Natural Experiments:
- Utilize exogenous variations caused by external events or policies that affect the treatment variable but are unrelated to the error term.

1.6 Reduced Form Model

Reduced Form Models refer to econometric models where the endogenous variables are expressed solely in terms of exogenous variables and error terms. These models simplify the relationship between variables by avoiding the need to specify the underlying structural model, focusing instead on the observed correlations.

1.6.1 Characteristics of Reduced Form Models:

Simplified Representation: Reduced form models express endogenous variables directly as functions of exogenous variables and error terms.
Focus on Exogeneity: They rely on exogenous variation to identify causal effects, avoiding direct specification of the structural relationships between variables.

1.6.2 Uses of Reduced Form Models:

Policy Evaluation: Reduced form models are often used in policy evaluation to estimate the causal impact of policies by leveraging exogenous variation.
Instrumental Variables: In IV estimation, the first stage regression (predicting the endogenous variable with instruments) is a reduced form model.
Natural Experiments: Reduced form models are frequently used in natural experiments where exogenous shocks provide a source of variation.

1.6.3 Example of a Reduced Form Model:

Suppose we want to estimate the impact of education ($E$) on earnings ($Y$):

Structural Model: \[ Y = \alpha + \beta E + \epsilon \]
Endogeneity Problem:
- Education ($E$) might be endogenous due to omitted variables like ability or family background.
Reduced Form Model:
- Use an instrument $Z$ (e.g., proximity to a college) that affects education but is exogenous with respect to earnings:
$E = \pi_0 + \pi_1 Z + \nu$
- The reduced form equation for earnings in terms of the instrument:
$Y = \gamma_0 + \gamma_1 Z + \eta$

Here, $\gamma_1$ provides an estimate of the causal effect of $ Z $ on $ Y $, which, under certain conditions, can be used to infer the effect of $E$ on $Y$ through $Z$.

In summary, understanding and addressing endogeneity is crucial for accurate causal inference in econometrics. Reduced form models provide a simplified framework to estimate relationships using exogenous variation, often serving as a preliminary step before more complex structural modeling.

1.7 Standard Errors

Homoskedasticity Assumption:

In linear regression, we assume that the variance of the error term is constant across all levels of the independent variables, i.e., $Var(\epsilon | X) = \sigma^2$.

Violation: If there is heteroscedasticity (non-constant variance of errors), the OLS estimates remain unbiased, but they are no longer efficient, and the standard errors are biased, leading to unreliable hypothesis tests. Heteroscedasticity-Robust standard errors or Generalized Least Squares (GLS) can be used to address heteroscedasticity.

Eiker-Huber-White: Heteroscedasticity-Robust standard errors
Cluster-robust standard errors (geographic units)
Without homoskedasticity assumption, OLS estimator will still be unbiased but not efficient. Robust standard error usage will not change the OLS estimator but will change the standard errors.
Without constant variance, mean squared errors are not minimum anymore. Estimated standard errors are biased.
In real life, errors will mostly be heteroskedastic
Solution for heteroskedasticity is mostly known as ‘robust’ standard errors.

1.7.1 heteroskedasticity-consistent standard errors

Also known as robust standard errors or The sandwich standard error estimator, is a technique used to obtain valid standard errors in the presence of heteroskedasticity. These standard errors are “robust” because they do not assume that the error terms have constant variance (homoscedasticity), making them useful for hypothesis testing and confidence intervals when the usual OLS assumptions are violated.

1.7.2 Why Use Sandwich Standard Errors?

In OLS regression, if the assumption of homoscedasticity is violated (i.e., the error variance is not constant), the usual standard errors of the estimated coefficients are biased. This bias can lead to incorrect inferences, such as invalid hypothesis tests and confidence intervals. Sandwich standard errors correct for this bias, providing more reliable inference.

1.7.2.1 How It Works

The sandwich estimator adjusts the standard errors of the OLS estimates to account for heteroscedasticity. The name “sandwich” comes from the structure of the formula, where the “bread” parts are the matrices that involve the model’s design matrix, and the “meat” part is a matrix involving the residuals.

1.7.3 Clustering Standard Errors

In the real world, though, you can never assume that errors are independent draws from the same distribution. You need to know how your variables were constructed in the first place in order to choose the correct error structure for calculating your standard errors. If you have aggregate variables, like class size, then you’ll need to cluster at that level. If some treatment occurred at the state level, then you’ll need to cluster at that level.
When the units of analysis are clustered into groups and the researcher suspects that the errors are correlated within (but not across) groups, it may be appropriate to employ variance estimators that are robust to the clustered nature of the data.
When we cluster standard errors at the state level, we allow for arbitrary serial correlation within state.
multi way clustering

1.7.3.1 When Should You Adjust Standard Errors for Clustering?

Abadie et al 2022

Formally, clustered standard errors adjust for the correlations induced by sampling the outcome variable from a data-generating process with unobserved cluster- level components.

Source

The authors argue that there are two reasons for clustering standard errors:

1- a sampling design reason, which arises because you have sampled data from a population using clustered sampling, and want to say something about the broader population;

2- and an experimental design reason, where the assignment mechanism for some causal treatment of interest is clustered. Let me go through each in turn, by way of examples, and end with some of their takeaways.

A Sampling Design reason

Consider running a simple Mincer earnings regression of the form: Log(wages) = a + byears of schooling + cexperience + d*experience^2 + e

You present this model, and are deciding whether to cluster the standard errors. Referee 1 tells you “the wage residual is likely to be correlated within local labor markets, so you should cluster your standard errors by state or village.”. But referee 2 argues “The wage residual is likely to be correlated for people working in the same industry, so you should cluster your standard errors by industry”, and referee 3 argues that “the wage residual is likely to be correlated by age cohort, so you should cluster your standard errors by cohort”. What should you do?

Under the sampling perspective, what matters for clustering is how the sample was selected and whether there are clusters in the population of interest that are not represented in the sample. So, we can imagine different scenarios here:

You want to say something about the association between schooling and wages in a particular population, and are using a random sample of workers from this population. Then there is no need to adjust the standard errors for clustering at all, even if clustering would change the standard errors.
The sample was selected by randomly sampling 100 towns and villages from within the country, and then randomly sampling people in each; and your goal is to say something about the return to education in the overall population. Here you should cluster standard errors by village, since there are villages in the population of interest beyond those seen in the sample.
This same logic makes it clear why you generally wouldn’t cluster by age cohort (it seems unlikely that we would randomly sample some age cohorts and not others, and then try and say something about all ages);
and that we would only want to cluster by industry if the sample was drawn by randomly selecting a sample of industries, and then sampling individuals from within each.

Even in the second case, Abadie et al. note that both the usual robust (Eicker-Huber-White or EHW) standard errors, and the clustered standard errors (which they call Liang-Zeger or LZ standard errors) can both be correct, it is just that they are correct for different estimands. That is, if you are content on just saying something about the particular sample of individuals you have, without trying to generalize to the population, the EHW standard errors are all you need; but if you want to say something about the broader population, the LZ standard errors are necessary.

The Experimental Design Reason for Clustering

The second reason for clustering is the one we are probably more familiar with, which is when clusters of units, rather than individual units, are assigned to a treatment. Let’s take the same equation as above, but assume that we have a binary treatment that assigns more schooling to people. So now we have: Log(wages) = a +b*Treatment + e

Then if the treatment is assigned at the individual level, there is no need to cluster (*).

There has been much confusion about this, as Chris Blattman explored in two earlier posts about this issue (the fabulously titled clusterjerk and clusterjerk the sequel), and I still occasionally get referees suggesting I try clustering by industry or something similar in an individually-randomized experiment. This Abadie et al. paper is now finally a good reference to explain why this is not necessary.

(*) unless you are using multiple time periods, and then you will want to cluster by individual, since the unit of randomization is individual, and not individual-time period.

What about if your treatment is assigned at the village level. Then cluster by village. This is also why you want to cluster difference-in-differences at the state-level when you have a source of variation that comes from differences across states, and why a “treatment” like being on one side of a border vs the other is problematic (because you have only 2 clusters).

1.8 Types of Biases

In econometrics and statistical analysis, various types of biases can affect the validity and reliability of estimates and inferences. Here are some common types of bias:

1.8.1 1. Selection Bias

Definition: Occurs when the sample is not representative of the population due to non-random selection of observations.

Examples:

Sample Selection Bias: When individuals are selected into the sample based on criteria related to the outcome of interest (e.g., studying the impact of education on earnings but only sampling employed individuals).
Survivorship Bias: When only surviving or existing subjects are included in the analysis, ignoring those that have dropped out or failed (e.g., analyzing the performance of mutual funds without including funds that have closed).

1.8.2 2. Omitted Variable Bias

Definition: Arises when a relevant variable that affects both the dependent and independent variables is left out of the model, causing the included variables to capture the effect of the omitted variable.

Examples: - Studying the effect of education on earnings without controlling for ability, where ability affects both education and earnings.

1.8.3 3. Measurement Bias

Definition: Occurs when there are errors in measuring the variables, leading to inaccuracies in the estimated relationships.

Examples:

Systematic Measurement Error: Consistent, predictable errors (e.g., a scale that always adds 2 pounds).
Random Measurement Error: Errors that vary without pattern (e.g., human errors in data entry).

1.8.4 4. Response Bias

Definition: Happens when respondents give inaccurate or false answers, often due to social desirability, recall issues, or misunderstanding the questions.

Examples: - Survey participants underreporting their alcohol consumption due to social desirability.

1.8.5 5. Attrition Bias

Definition: Results from systematic differences between those who drop out of a study and those who remain, leading to a non-representative sample over time.

Examples: - A long-term study on the effects of a diet where participants who do not see results drop out, leaving only those who benefit.

1.8.6 6. Publication Bias

Definition: Arises when studies with significant or positive results are more likely to be published than studies with non-significant or negative results.

Examples: - Meta-analyses showing inflated effects due to the exclusion of unpublished studies with null results.

1.8.7 7. Survivorship Bias

Definition: Occurs when only successful subjects or cases are considered, ignoring those that failed or were excluded from the sample.

Examples:

Analyzing the performance of companies that are currently listed on the stock exchange, ignoring those that went bankrupt.

1.8.8 8. Recall Bias

Definition: Happens when participants do not remember past events accurately, leading to inaccurate data.

Examples: - Patients in a medical study failing to accurately recall past health behaviors.

1.8.9 9. Confirmation Bias

Definition: The tendency to search for, interpret, and remember information that confirms pre-existing beliefs, leading to biased outcomes.

Examples: - Researchers focusing on data that supports their hypothesis while disregarding data that contradicts it.

1.8.10 10. Confounding Bias

Definition: Occurs when the effect of the primary explanatory variable on the outcome is mixed with the effect of another variable (confounder) that is related to both the explanatory variable and the outcome.

Examples: - Studying the effect of smoking on lung cancer without controlling for age, where age is a confounder.

1.8.11 11. Endogeneity Bias

Definition: Arises when an explanatory variable is correlated with the error term in the model, often due to omitted variables, measurement error, or reverse causality.

Examples: - Estimating the impact of education on earnings without accounting for the fact that higher ability individuals are more likely to obtain more education and earn higher wages.

1.8.12 12. Non-Response Bias

Definition: Occurs when individuals who do not respond to a survey differ in meaningful ways from those who do respond.

Examples: - A survey on household income where higher-income households are less likely to respond, skewing results towards lower-income households.

1.8.13 13. Observer Bias

Definition: Happens when researchers’ expectations or knowledge influence the outcome of the study or the interpretation of results.

Examples: - A researcher subtly influencing participants’ responses in a study on therapy effectiveness due to their belief in the therapy’s efficacy.

1.8.14 14. Overfitting Bias

Definition: In predictive modeling, overfitting occurs when the model is too complex and captures noise rather than the underlying relationship, leading to poor generalization to new data.

Examples: - A regression model with too many parameters that fits the training data very well but performs poorly on validation data.

1.8.15 Addressing Biases

To mitigate these biases, researchers can use various strategies, including: - Randomization: Randomly assigning subjects to treatment and control groups to avoid selection bias and confounding. - Control Groups: Including control groups to help identify causal effects. - Instrumental Variables: Using instruments to address endogeneity. - Robustness Checks: Performing sensitivity analyses to check the stability of results under different assumptions. - Data Cleaning and Validation: Ensuring accurate measurement and data entry. - Blinding: Keeping participants and researchers unaware of group assignments to reduce observer and response biases.

Understanding and addressing these biases is crucial for improving the validity and reliability of econometric analyses.

1.8.16 Self-Selection Bias

Self-Selection Bias occurs when individuals select themselves into a group, causing a non-random sample that may not be representative of the population. This type of bias can severely impact the validity of causal inferences because the differences in outcomes between groups may be driven by the self-selection process rather than the treatment or intervention itself.

1.8.16.1 Examples of Self-Selection Bias:

Online Surveys:
- If participation in an online survey is voluntary, those who choose to respond may have different characteristics or opinions compared to those who do not participate.
Program Participation:
- When studying the impact of a job training program, individuals who opt into the program might be more motivated or have better baseline skills than those who do not, leading to biased estimates of the program’s effectiveness.
Product Reviews:
- Customers who leave reviews for a product might have extreme opinions (either very positive or very negative), while those with moderate opinions are less likely to leave a review, skewing the perception of the product’s quality.
Health Studies:
- People who enroll in health studies or clinical trials might be more health-conscious or have a particular interest in their health, which could lead to differences in health outcomes compared to the general population.

1.8.16.2 Addressing Self-Selection Bias

Randomized Controlled Trials (RCTs):
- The gold standard for addressing self-selection bias is to conduct an RCT where participants are randomly assigned to treatment and control groups, ensuring any differences in outcomes are due to the intervention.
Propensity Score Matching (PSM):
- This method involves matching participants in the treatment group with similar participants in the control group based on a set of observed characteristics, aiming to mimic randomization.
Instrumental Variables (IV):
- Using instruments that affect participation in the treatment but do not directly affect the outcome can help isolate the causal effect of the treatment by accounting for the self-selection.
Heckman Correction (Selection Models):
- The Heckman two-step correction involves modeling the selection process and then incorporating this model into the outcome equation to correct for self-selection bias.
Difference-in-Differences (DiD):
- This approach compares changes in outcomes over time between a treatment group and a control group, assuming that any differences in trends can be attributed to the intervention.
Control Variables:
- Including control variables in the regression model that capture the factors influencing self-selection can help mitigate bias, although it relies on the assumption that all relevant factors are observed and included.

1.8.16.3 Conclusion

Self-selection bias is a common challenge in observational studies where individuals choose their own treatment or participation status. It can lead to biased and inconsistent estimates if not properly addressed. Methods such as propensity score matching, instrumental variables, and randomized controlled trials can help mitigate self-selection bias and provide more reliable causal inferences.