Concepts (R)

Confounding

Confounding is a pervasive concern in epidemiology, especially in observational studies focusing on causality. Epidemiologists need to carefully select confounders to avoid biased results due to third factors affecting the relationship between exposure and outcome. Commonly used methods for selecting confounders, such as change-in-estimator or solely relying on p-value-based statistical methods, may be inadequate or even problematic.

Epidemiologists need a more formalized system for confounder selection, incorporating causal diagrams (Greenland, Pearl, and Robins 1999; Tennant et al. 2021) and counterfactual reasoning. This includes an understanding of the underlying causal relationships and the potential impacts of different variables on the observed association. Understanding the temporal order and causal pathways is crucial for accurate confounder control.

However, it is possible that epidemiologists may lack comprehensive knowledge about the causal roles of all variables and hence may need to resort to empirical criteria (VanderWeele 2019) such as the disjunctive cause criterion, or other variable selection methods such as machine learning approaches. While these methods can provide more sophisticated analyses and help address the high dimensionality and complex structures of modern epidemiological data, epidemiologists need to understand how these approaches function, along with their benefits and limitations, to avoid introducing additional bias into the analysis.

Effect modifier

Effect modification and interaction are two distinct concepts in epidemiology (VanderWeele 2009; Bours 2021). Effect modification occurs when the causal effect of an exposure (A) on an outcome (Y) varies based on the levels of a third factor (B).

In this scenario, the association between the exposure and the outcome differs within the strata of a second exposure, which acts as the effect modifier. For instance, the impact of alcohol (A) on oral cancer (Y) might differ based on tobacco smoking (B).

On the other hand, interaction refers to the joint causal effect of two exposures (A and B) on an outcome (Y). It examines how the combination of multiple exposures influences the outcome, such as the combined effect of alcohol (A) and tobacco smoking (B) on oral cancer (Y).

In essence, while effect modification looks at how a third factor influences the relationship between an exposure and an outcome, interaction focuses on the combined effect of two exposures on the outcome.

Table 2 fallacy

The “Table 2 Fallacy” in epidemiology refers to the misleading practice of presenting multiple adjusted effect estimates from a single statistical model in one table, often resulting in misinterpretation. This occurs when researchers report both the primary exposure’s effects and secondary exposures’ (often an adjustment variable for the primary exposure) effects without adequately distinguishing between the types of effects or considering the causal relationships among variables.

This idea highlights the potential for misunderstanding in interpreting the effects of various exposures on an outcome when they are reported together, leading to confusion over the nature and magnitude of the relationships and possibly influencing the design and interpretation of further studies (Westreich and Greenland 2013). The fallacy demonstrates the need for careful consideration of the types of effects estimated and reported in statistical models, urging researchers to be clear about the distinctions and implications of controlled direct effects, total effects, and the presence of confounding or mediating variables.

Reading list

Confounding key reference: (VanderWeele 2019; Tennant et al. 2021)

Effect modification key reference: (VanderWeele 2009; Bours 2021)

Table 2 fallacy key reference: (Westreich and Greenland 2013)

Optional reading:

Video Lessons

The Epistemological Divide: Explanatory versus Predictive Modeling

Before dissecting specific confounder selection techniques, it is crucial to establish the epistemological distinction that governs variable selection: the divergence between predictive and causal inference goals. This distinction is frequently conflated in practice, leading to the misapplication of algorithms designed for one purpose to the problems of the other.

The Goal of Prediction

In predictive modeling, the objective is to minimize the expected loss (e.g., mean squared error) between the predicted and observed outcome values. In this context, a “good” variable is one that is strongly correlated with the outcome, regardless of the direction of causality. A variable that is a consequence of the outcome (a proxy) or a mediator of the exposure can be an excellent predictor. Variable selection methods in this domain, such as standard stepwise regression, Akaike Information Criterion (AIC) minimization, or standard Lasso regularization, are designed to identify a parsimonious set of correlates that maximize model fit and reduce prediction error.

The Goal of Causal Explanation

In causal inference, the objective is to isolate the specific marginal effect of an intervention (exposure) on an outcome. Here, the correlation is only useful if it reflects a structural cause-effect relationship. Including a mediator in the model will increase the \(R^2\) (predictive power) but will bias the estimation of the total causal effect toward the null.

Consequently, variable selection methods optimized for prediction are often mathematically antagonistic to causal inference. Techniques that rely on “goodness-of-fit” or statistical significance can inadvertently select colliders (inducing bias) or drop weak confounders that are critical for validity. The failure to distinguish these goals is a primary source of methodological error in the medical literature, motivating the need for distinct, causally-grounded selection strategies.

The Counterfactual Framework for Defining Causality

Defining the Causal Effect: Potential Outcomes

To understand causality, one must first be able to imagine a world that does not exist. The potential outcomes framework formalizes this by defining the causal effect of an exposure in terms of what would have happened under different exposure scenarios. Let us define the key notations:

  • A: The exposure status of an individual (e.g., \(A=1\) if a smoker, \(A=0\) if a non-smoker).
  • Y: The outcome of interest (e.g., hypertension).
  • L: A measured covariate or potential confounder.
  • U: An unmeasured variable.

For any individual, we can define two potential outcomes:

  • Y(A=1): The outcome that would be observed if the individual were a smoker.
  • Y(A=0): The outcome that would be observed if that same individual were a non-smoker at the same point in time.

The Individual Treatment Effect (TE) is the difference between these two potential outcomes for a single person: \(TE = Y(A=1) - Y(A=0)\). For example, if a patient named John smokes (\(A=1\)) and develops hypertension, while he would not have developed hypertension had he not smoked (\(A=0\)), the causal effect of smoking for John is present.

The Fundamental Problem of Causal Inference

The definition of the individual TE immediately presents a profound challenge. For any given individual, we can only ever observe one of their potential outcomes. If John smokes, we observe \(Y(A=1)\), but his counterfactual outcome, \(Y(A=0)\), remains unobserved forever. This is known as the fundamental problem of causal inference; it is a problem of missing data where half the data is always missing for every subject.

Because the individual TE is unobservable, the goal of epidemiology shifts from the individual to the population. We instead seek to estimate the Average Treatment Effect (ATE), defined as the average of the individual effects across all subjects in a population: \(ATE = E\).

From Association to Causation: The Role of Confounding

In the real world, we cannot directly observe both potential outcomes for a population. Instead, we observe outcomes in two different groups of people: those who happened to be exposed (smokers) and those who were not (non-smokers). We can calculate the associational difference between these groups: \(E - E\). A critical error is to assume this associational difference is equal to the causal ATE.

This difference arises because of confounding. The groups of smokers and non-smokers may differ systematically on factors that also affect the outcome. For instance, individuals with lower socioeconomic status may be more likely to smoke and also have a higher underlying risk of hypertension for reasons unrelated to smoking (e.g., diet, stress). In this case, the observed difference in outcomes is a mixture of the true treatment effect and these pre-existing, systematic differences between the groups.

The Observational Study Solution: Conditional Exchangeability

Randomized Controlled Trials (RCTs) are the gold standard for causal inference because the process of randomization, with a large enough sample size, ensures that the exposed and unexposed groups are, on average, identical (“exchangeable”) on all baseline characteristics, both measured and unmeasured. In an RCT, any systematic differences are eliminated, making the associational difference a valid estimate of the causal ATE.

In observational studies, where randomization is not possible, we cannot achieve this level of exchangeability. Instead, we strive for conditional exchangeability. This is the assumption that, within strata of the measured confounders, the exposed and unexposed groups are exchangeable. By estimating the effect of smoking separately within each level of the confounder(s) \(L\) (e.g., estimating the effect of smoking separately for different age groups) and then averaging these stratum-specific effects, we can aim to reconstruct the causal ATE. This process of stratification, or “adjustment,” is the conceptual basis for controlling for confounding in observational research. However, its validity rests entirely on the critical and untestable assumption that we have successfully identified and measured all important common causes of the exposure and the outcome.

What is included in this Video Lesson:

  • 0:00 Introduction
  • 0:16 Notations
  • 2:40 Treatment Effect
  • 6:13 Real-world Problem of the counterfactual definition
  • 9:44 Real-world Solution in Observational Setting

The timestamps are also included in the YouTube video description.

Structural and Knowledge-Based Selection Techniques

To properly address confounding, researchers need a tool to translate their subject-matter knowledge and assumptions about the world into a formal structure. Directed Acyclic Graphs (DAGs) serve this purpose, providing a visual language and a set of rigorous rules for identifying sources of bias and guiding statistical analysis.

The Grammar of Causal Diagrams

A DAG is a graphical model of causal relationships between variables. Its components follow a simple grammar:

  • Nodes: Represent variables (e.g., smoking, hypertension, age).
  • Arrows (Directed Edges): Represent a direct causal effect from one variable to another.
  • Directed: The arrows have a single head, indicating the assumed direction of causality.
  • Acyclic: A path of arrows cannot form a closed loop. This enforces the principle of temporality: a variable cannot be its own cause.

Crucially, the most powerful assumptions in a DAG are the absent arrows. The absence of an arrow between two variables represents a strong claim of no direct causal effect.

Paths: Causal and Non-Causal

A path is any sequence of arrows connecting two variables, regardless of the direction of the arrowheads. When assessing the relationship between an exposure like smoking (\(A\)) and an outcome like hypertension (\(Y\)), paths can be categorized into two critical types:

  • Causal Paths (Front-door paths): These are paths that begin with an arrow originating from \(A\) and moving toward \(Y\) (e.g., \(A \rightarrow \text{Stress} \rightarrow Y\)). These paths transmit the causal effect of \(A\) on \(Y\) that we wish to estimate.
  • Non-Causal Paths (Back-door paths): These are paths between \(A\) and \(Y\) that begin with an arrow pointing into \(A\) (e.g., \(A \leftarrow \text{Age} \rightarrow Y\)). These paths are sources of non-causal association (confounding) that can bias our estimate. The goal of adjustment is to “block” these backdoor paths.

The Three Elementary Causal Structures

All complex DAGs are composed of three fundamental building blocks. Understanding how information flows through these structures is the key to using DAGs to identify and control for bias.

  • The Fork (Confounding): The structure is \(A \leftarrow L \rightarrow Y\). Here, \(L\) is a common cause of both the exposure \(A\) and the outcome \(Y\).
    • Example: Age (\(L\)) is a common cause of both smoking habits (\(A\)) and hypertension (\(Y\)).
    • Rule: The backdoor path through a common cause is open by default, creating a spurious association. To remove this confounding, one must condition on the confounder \(L\), which blocks the path.
  • The Chain (Mediation): The structure is \(A \rightarrow M \rightarrow Y\). Here, \(M\) is a mediator that lies on the causal pathway.
    • Example: Smoking (\(A\)) causes chronic inflammation (\(M\)), which in turn causes hypertension (\(Y\)).
    • Rule: The causal path through a mediator is open by default. To estimate the total effect of \(A\) on \(Y\), one must not condition on the mediator \(M\). Doing so would block this part of the causal effect.
  • The Collider (Selection/Collider Bias): The structure is \(A \rightarrow L \leftarrow Y\). Here, \(L\) is a common effect of both \(A\) and \(Y\).
    • Example: Both smoking (\(A\)) and a genetic predisposition (\(Y\)) can lead to a specific biomarker level (\(L\)).
    • Rule: The path through a collider is blocked by default. However, conditioning on the collider \(L\) opens the path, inducing a spurious, non-causal association between \(A\) and \(Y\). Adjusting for a collider is a critical error that introduces bias.

Applying the Rules with Dagitty

In practice, causal systems can be highly complex. Software such as Dagitty.net automates the application of these path-blocking rules. Given a user-drawn DAG, Dagitty can identify all open backdoor paths and determine the minimal sufficient adjustment sets: the smallest set of covariates that, if conditioned on, will block all backdoor paths and allow for an unbiased estimation of the total causal effect.

The video lesson split into 3 parts

DAG codes:

Example DAG codes can be accessed from this GitHub repository folder

Empirical Criteria for DAG-Deficient Scenarios

In many practical epidemiological investigations, particularly those involving novel exposures or complex metabolic pathways, the full causal structure is unknown. The uncertainty regarding the presence or direction of arrows makes the strict construction of a DAG impossible. In such “DAG-deficient” scenarios, epidemiologists must resort to pragmatic heuristics or empirical criteria that aim to approximate the Backdoor Criterion with less stringent assumptions.

In the absence of a fully specified DAG, researchers can rely on a set of empirical criteria that require less stringent assumptions.

Pre-treatment Criterion

One of the simplest and most intuitive heuristics is the Pre-treatment Criterion, which dictates adjusting for all covariates measured chronologically before the exposure was administered or assigned.

Rationale: The logic is grounded in temporal causality; a variable occurring before the exposure cannot be a downstream effect (mediator) of the exposure. Therefore, adjusting for pre-treatment variables avoids the error of overadjustment via mediation.

Critique:

  1. While this criterion successfully avoids adjusting for mediators, it fails to protect against M-bias. A pre-treatment variable can still be a collider if it is caused by two unobserved latent variables—one linked to the exposure and one to the outcome. Adjusting for such a pre-treatment collider introduces bias.

  2. This “kitchen sink” approach often leads to the inclusion of Instrumental Variables (IVs)—pre-treatment variables that cause the exposure but have no independent effect on the outcome. As discussed later, adjusting for IVs inflates the variance of the estimator and can amplify bias due to residual unmeasured confounding (Z-bias).

Thus, while the Pre-treatment Criterion is a helpful starting point, it is often too crude for high-stakes causal inference.

Common Cause Criterion

The Common Cause Criterion refines the selection process by narrowing the adjustment set to variables known (or suspected) to be causes of both the exposure and the outcome.

Rationale: This criterion targets the classical epidemiological definition of a confounder. By restricting selection to common causes, it theoretically avoids colliders (which are effects) and instruments (which are causes of exposure only).

Critique: The major limitation of this approach is its reliance on definitive knowledge. If a researcher is unsure whether a variable causes the outcome, the strict application of this criterion would lead to its exclusion. However, standard bias analysis suggests that omitting a true confounder (due to uncertainty) generally introduces more bias than including a non-confounder. Therefore, the Common Cause Criterion is often viewed as overly conservative, potentially leading to residual confounding in the pursuit of parsimony.

Disjunctive Cause Criterion

To address the limitations of the Common Cause Criterion, the Disjunctive Cause Criterion is proposed as a pragmatic strategy for confounder selection (VanderWeele 2019).

The Rule: Control for any pre-exposure covariate that is

  1. a cause of the exposure, OR
  2. a cause of the outcome, OR
  3. both.

Mechanism: This union-based approach ensures that all common causes (confounders) are included, as they satisfy the condition of being a cause of both. By including variables that are only causes of the outcome, the method improves the precision of the estimate (reducing standard error) without introducing bias. By including variables that are only causes of the exposure (potential instruments), it risks some variance inflation, but this is often considered an acceptable trade-off to ensure no confounders are missed.

Strength: The primary strength of the Disjunctive Cause Criterion is its robustness to uncertainty regarding the full causal structure. The researcher does not need to know if a variable affects both exposure and outcome; knowing it affects at least one is sufficient for inclusion. This effectively minimizes the risk of unadjusted confounding while generally avoiding colliders (which are effects, not causes).

Modified Disjunctive Cause Criterion

Refining the Disjunctive Cause Criterion further, the Modified Disjunctive Cause Criterion incorporates specific exclusions and inclusions to optimize both validity and efficiency.

Exclude IVs: Recognizing the variance inflation and Z-bias risks associated with instruments, the modified criterion explicitly removes variables known to affect the exposure but not the outcome. This requires some structural knowledge but yields a more efficient estimator.

Include Proxies: Acknowledging that true confounders are often unmeasured, the modified criterion mandates the inclusion of measured variables that serve as proxies for the unmeasured common causes. Even if a proxy is not a direct cause, adjusting for it partially blocks the backdoor path transmitted through the unobserved parent variable.

Modelling criteria for variable selection

Statistical methods can also be used for variable selection, but their application requires careful consideration of the research goal: prediction versus causal inference.

Change-in-Estimate

The Change-in-Estimate (CIE) method represents an operationalization of the definition of confounding: if a variable is a confounder, adjusting for it should change the estimated effect of the exposure.

The Procedure: The researcher begins with a “crude” model containing only the exposure and outcome. Potential confounders are added to the model one by one (or removed from a full model). If the regression coefficient for the exposure changes by more than a specified percentage (commonly 10%), the variable is deemed a confounder and retained in the model.

The Non-Collapsibility Trap: A critical flaw of the CIE method arises when using non-collapsible effect measures, such as the OR or HR. In logistic regression, the addition of a covariate that is strongly associated with the outcome (but independent of the exposure) will increase the magnitude of the exposure’s OR—driving it further from the null. This occurs not because of confounding bias, but because of a mathematical property known as non-collapsibility. A CIE algorithm would interpret this change as evidence of confounding and select the variable, potentially leading to over-adjustment or misinterpretation of the effect measure modification. Thus, CIE is safer for RDs or RRs but hazardous for ORs.

Statistical Significance (Stepwise Selection)

Stepwise selection algorithms (forward selection, backward elimination, or bidirectional search) rely on statistical significance (p-values) to determine variable inclusion.

The Procedure: Variables are added to the model if their association with the outcome yields a p-value below a certain threshold (e.g., 0.05) or removed if the p-value exceeds it.

The Confounding vs. Significance Fallacy: The most fundamental critique of this approach is that “confounding is not a significance test.” A variable can be a strong confounder—systematically biasing the effect estimate—even if its association with the outcome fails to reach statistical significance in a specific sample, particularly in small studies. Relying on p-values often leads to under-adjustment and residual confounding.

Post-Selection Inference: Stepwise selection invalidates the statistical theory behind confidence intervals. The final model treats the selected variables as if they were specified a priori, ignoring the immense “data dredging” and multiple testing that occurred during the selection process. This results in standard errors that are systematically too small and confidence intervals that are too narrow, creating a false sense of precision.

Prediction vs. Causation: Ultimately, stepwise algorithms are designed to maximize model fit (prediction). They will happily select a collider or a mediator if it is strongly correlated with the outcome, thereby maximizing \(R^2\) while destroying the validity of the causal coefficient.

Purposeful Selection of Covariates

Recognizing the limitations of purely mechanical stepwise regression, the “Purposeful Selection” algorithm, a hybrid approach was proposed (Hosmer, Lemeshow, and Sturdivant 2013; Bursac et al. 2008)that combines statistical criteria with researcher judgment and confounding checks.

The Algorithm:

  1. Univariate Screening:
    • Evaluate all covariates individually.
    • Retain any variable with a univariate p-value \(< 0.25\). This relaxed threshold is crucial; it aims to capture potential confounders that may be weak individually but strong jointly, or whose effects are masked in univariate analysis.
  2. Multivariable Model:
    • Fit a model with all candidates identified in step 1.
    • Remove variables that are not significant at traditional levels (e.g., \(p < 0.05\)).
  3. Confounding Check: This is the distinguishing feature.
    • Before permanently discarding a variable, the analyst must check if its removal induces a major change (\(>15-20\%\)) in the coefficients of the remaining variables.
    • If it does, the variable is added back into the model as a confounder, regardless of its statistical significance.
  4. Refinement and Interactions: Excluded variables are added back one by one to check for residual significance. Finally, the model is checked for plausible interactions.

Insight: Purposeful Selection is widely cited in epidemiology because it operationalizes the definition of confounding within the selection process. Unlike rigid stepwise regression, it prioritizes the stability of the exposure coefficient over the parsimony of the outcome model. It forces the analyst to examine the data at each step, acting as a safeguard against the automation of causal errors.

Criticism: Purposeful Selection is now considered outdated and flawed by modern causal inference standards. Its fundamental weakness is that it remains entirely driven by statistical associations within the data rather than by a priori causal structure. The “confounding check” (Step 3), its distinguishing feature, is ironically its most critical flaw. This change-in-estimator (CIE) criterion cannot distinguish true confounders from colliders or mediators. In the case of a collider, adjusting for it induces a spurious association (bias), which causes a large change in the exposure’s coefficient. The algorithm misinterprets this induced bias as a sign of confounding and therefore retains the collider, leading to a biased final estimate. Because it is “causally blind,” it is not a safeguard against causal errors and is superseded by methods like those based on DAGs.

Machine Learning (ML)

Algorithms such as LASSO and Random Forests are excellent for high-dimensional prediction. Their primary role in causal inference is in developing propensity score (PS) models, which is a prediction task for the exposure model (Karim and Lei 2025). The goal is to create a score that balances measured covariates between the exposed and unexposed groups, mimicking randomization.

Criticism: The variance estimation can be poor depending on the machine learning method used to do the variable selection, often resulting in poor coverage.

Advanced Causal Inference Methods, often incorporating ML
  1. High-Dimensional Propensity Score (hdPS) (Schneeweiss et al. 2009; Karim et al. 2025): designed for healthcare databases. It algorithmically scans thousands of proxy variables (e.g., prior diagnoses, medications) and selects those that are most likely to be confounders to include in the propensity score model.
  2. Machine learning versions of hdPS (Karim 2025; Karim, Pang, and Platt 2018): These models are excellent at capturing complex, non-linear relationships and interactions among covariates. See external workshop materials here.
  3. Post-double-selection method (Belloni, Chernozhukov, and Hansen 2014): It formally recognizes that a confounder must be related to both the exposure and the outcome. It use a machine learning method (e.g., LASSO) to select all covariates that are predictive of the outcome, and then again uses LASSO to select all covariates that are predictive of the exposure. The final set of confounders to adjust for is the union (all variables from both lists). This algorithmically mimics the “Disjunctive Cause Criterion” (adjust for causes of Exposure or Outcome). It is robust and avoids the biases of selecting based only on the outcome. Runs a simple (non-penalized) regression for the final estimate, adjusting for the union set.
  4. Outcome-Adaptive Lasso (Shortreed and Ertefaie 2017; Baldé, Yang, and Lefebvre 2023): This is a variation of LASSO that essentially performs “double selection” in a single step. It’s a penalized regression (LASSO) for the outcome model, but the penalty for each covariate is adapted (weighted). Covariates that are strongly predictive of the exposure are given a smaller penalty, making them more likely to be kept in the final outcome model, regardless of their association with the outcome.
  5. Collaborative Targeted Maximum Likelihood Estimation (C-TMLE) (Laan and Gruber 2010): It uses machine learning (often a “Super Learner” that combines many ML algorithms) to build the best possible outcome model. Then, it collaboratively uses information from that model to decide which covariates also need to go into the propensity score model to minimize bias. This is an extension of the TMLE method that we cover later.
Collapsibility and the Choice of Effect Measure

A crucial, and often overlooked, aspect of statistical adjustment is the concept of collapsibility. An effect measure is said to be collapsible if the marginal (crude) measure of association is equal to a weighted average of the stratum-specific measures of association after conditioning on another variable. This property has profound implications for how we interpret adjusted estimates.

In the absence of confounding, some effect measures, like the Risk Difference (RD) and Risk Ratio (RR), are collapsible. This means that if a variable is not a confounder, adjusting for it will not change the effect estimate. However, other common measures, most notably the Odds Ratio (OR), are non-collapsible.

The non-collapsibility of the odds ratio is a mathematical property stemming from the non-linearity of the logistic model’s link function. It means that the adjusted OR can be different from the crude OR even when there is no confounding. This phenomenon, where an association in a population differs from the association within its subgroups, is also known as Simpson’s Paradox (in the absence of confounding). This is precisely why the change-in-estimate criterion for confounder selection is invalid when using odds ratios—a change in the OR upon adjustment does not necessarily signal the presence of confounding.

Simpson’s Paradox: A Case Study in Bias

Simpson’s Paradox is a statistical phenomenon where an association observed in a population is different from—and often in the opposite direction of—the associations observed in all of its subgroups. This paradox is a powerful illustration of how failing to account for a key third variable (a confounder or a collider) can lead to completely erroneous conclusions.

A famous example is the “Birthweight Paradox,” where maternal smoking appeared to be protective against infant mortality among low-birthweight infants, a finding that contradicted the known harms of smoking. This occurred because birthweight acted as a collider. Adjusting for it induced a spurious association between smoking and other unmeasured causes of mortality (e.g., birth defects).

Unpacking Effect Heterogeneity: Interaction vs. Effect Modification

The effect of an exposure may not be uniform across a population. A third variable can alter the exposure-outcome relationship, a phenomenon that leads to frequent confusion between two distinct concepts: interaction and effect modification.

Formal Definitions

While often used interchangeably, these terms address different causal questions:

  • Effect Modification: This occurs when the causal effect of a single exposure (e.g., smoking) on an outcome (hypertension) differs across strata of a second variable (e.g., education level). The question is: “Is the effect of smoking different for people with high education versus people with low education?” This involves only one intervention (on smoking). The variable ‘education’ is treated as a baseline characteristic defining subgroups.
  • Interaction: This refers to the joint causal effect of two exposures (e.g., smoking and low education) on an outcome (hypertension). The question is: “Is the effect of intervening on both smoking and education greater than the sum of the effects of intervening on each one alone?” This involves two distinct interventions and assesses synergy or antagonism.

Implications for Confounding Control

The distinction is critical for analytical strategy:

  • To assess Effect Modification: When investigating if education modifies the effect of smoking on hypertension, a researcher only needs to control for the set of confounders of the smoking -> hypertension relationship.
  • To assess Interaction: When investigating the causal interaction between smoking and education, a researcher must control for all confounders of the smoking -> hypertension relationship AND all confounders of the education -> hypertension relationship. This is a much more demanding requirement.

The Role of the Scale: Effect Measure Modification

Whether modification is detected can depend on the statistical scale used (e.g., additive scale for Risk Difference vs. multiplicative scale for Risk Ratio). For this reason, the more precise term is effect measure modification. A statistical finding of interaction is a property of the chosen model and does not necessarily correspond to a specific biological mechanism.

Reporting guideline

See Knol and VanderWeele (2012)

Comparison of Reporting Guidelines (Knol & Vanderweele, 2012)
Recommendations from Knol & Vanderweele (2012)
Reporting Component Guideline for Effect Modification Guideline for Interaction
Purpose To show how the effect of one primary exposure (A) is modified by the strata of another factor (X). To show the causal, joint effect of two distinct exposures (A and B) acting together.
Step 1: Joint Effects Required. (e.g., ORs for all A/X combinations vs. a single reference, A=0, X=0). Required. (e.g., ORs for all A/B combinations vs. a single reference, A=0, B=0).
Step 2: Stratum-Specific Effects Required (SUBSET). Show only the effect of A within each stratum of X. Required (FULL). Show the effect of A within each stratum of B... ...AND... ...the effect of B within each stratum of A.
Step 3: Interaction Measures Required. Report measures for both additive (e.g., RERI) and multiplicative (e.g., ROR) scales, with CIs and p-values. Required. Report measures for both additive (e.g., RERI) and multiplicative (e.g., ROR) scales, with CIs and p-values.
Step 4: Confounder Adjustment Required. Adjust for confounders of the primary A-D relationship. Required. Adjust for confounders of *both* the A-D relationship *and* the B-D relationship.
Important

To revisit or deepen your grasp of these two concepts, consider reviewing this external tutorial.

Avoiding Misinterpretation: The Table 2 Fallacy

One of the most common errors in reporting observational research is the Table 2 Fallacy. This fallacy is the practice of presenting a single multivariable regression model and interpreting the coefficients for all variables—the primary exposure and all adjustment covariates—as if they are equally valid estimates of the total causal effect of each variable on the outcome.

Why A Single Model Fails: A DAG-Based Explanation

A multivariable regression model is built to answer a single, specific causal question. The adjustment set required to estimate the causal effect of one variable is often different from the set required to estimate the effect of another.

Consider a DAG for the effects of smoking, age, and hypertension:

  • Causal Question 1: What is the total effect of Smoking on Hypertension?
    • Assume Age is a common cause of both Smoking and Hypertension. To get an unbiased estimate of the total effect of Smoking, one must adjust for Age. The appropriate model is: Hypertension ~ Smoking + Age. The coefficient for Smoking can be interpreted as the total causal effect.
  • Causal Question 2: What is the total effect of Age on Hypertension?
    • In this same DAG, Smoking may be a mediator of the effect of Age (i.e., Age -> Smoking -> Hypertension). To estimate the total effect of Age, one must not adjust for the mediator, Smoking. The model built for Question 1 does adjust for Smoking. Therefore, the coefficient for Age in that first model is not an estimate of the total effect; it is an estimate of the controlled direct effect—the effect of Age on Hypertension that does not operate through the Smoking pathway.

Best Practices for Reporting

To avoid the Table 2 Fallacy, analysis and reporting must be driven by a “one exposure, one model” principle:

  • Be Explicit: Clearly state the single primary exposure of interest for each model.
  • Use Multiple Models: If causal effects are desired for multiple variables, fit a separate, correctly specified model for each one.
  • Structure Tables Clearly: The primary results table should only show the effect estimate for the main exposure of interest. The covariates used for adjustment should be listed in a footnote, not in the table with their own effect estimates.

Video Lesson Slides

Confounding

Effect modification

Table 2 fallacy

References

Baldé, Ismaila, Yi Archer Yang, and Geneviève Lefebvre. 2023. “Reader Reaction to ‘Outcome-Adaptive Lasso: Variable Selection for Causal Inference’ by Shortreed and Ertefaie (2017).” Biometrics 79 (1): 514–20.
Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “Inference on Treatment Effects After Selection Among High-Dimensional Controls.” Review of Economic Studies 81 (2): 608–50.
Bours, Martijn JL. 2021. “Tutorial: A Nontechnical Explanation of the Counterfactual Definition of Effect Modification and Interaction.” Journal of Clinical Epidemiology 134: 113–24.
Bursac, Zoran, C Heath Gauss, David Keith Williams, and David W Hosmer. 2008. “Purposeful Selection of Variables in Logistic Regression.” Source Code for Biology and Medicine 3 (1): 17.
Etminan, Mahyar, Gary S Collins, and Mohammad Ali Mansournia. 2020. “Using Causal Diagrams to Improve the Design and Interpretation of Medical Research.” Chest 158 (1): S21–28.
Greenland, Sander, Judea Pearl, and James M Robins. 1999. “Causal Diagrams for Epidemiologic Research.” Epidemiology, 37–48.
Heinze, Georg, Christine Wallisch, and Daniela Dunkler. 2018. “Variable Selection–a Review and Recommendations for the Practicing Statistician.” Biometrical Journal 60 (3): 431–49.
Hosmer, Jr., David W., Stanley Lemeshow, and Rodney X. Sturdivant. 2013. Applied Logistic Regression, 3rd Edition. Hoboken, NJ: John Wiley & Sons.
Karim, Mohammad Ehsanul. 2025. “High-Dimensional Propensity Score and Its Machine Learning Extensions in Residual Confounding Control.” The American Statistician 79 (1): 72–90.
Karim, Mohammad Ehsanul, Md Belal Hossain, Huah Shin Ng, Feng Zhu, Hanna A Frank, and Helen Tremlett. 2025. “Evaluating the Role of High-Dimensional Proxy Data in Confounding Adjustment in Multiple Sclerosis Research: A Case Study.” Pharmacoepidemiology and Drug Safety 34 (2): e70112.
Karim, Mohammad Ehsanul, and Yang Lei. 2025. “How Effective Are Machine Learning and Doubly Robust Estimators in Incorporating High-Dimensional Proxies to Reduce Residual Confounding?” Pharmacoepidemiology and Drug Safety 34 (5): e70155.
Karim, Mohammad Ehsanul, Menglan Pang, and Robert W Platt. 2018. “Can We Train Machine Learning Methods to Outperform the High-Dimensional Propensity Score Algorithm?” Epidemiology 29 (2): 191–98.
Knol, Mirjam J, and Tyler J VanderWeele. 2012. “Recommendations for Presenting Analyses of Effect Modification and Interaction.” International Journal of Epidemiology 41 (2): 514–20.
Laan, Mark J van der, and Susan Gruber. 2010. “Collaborative Double Robust Targeted Maximum Likelihood Estimation.” The International Journal of Biostatistics 6 (1): 17.
Lederer, David J, Scott C Bell, Richard D Branson, James D Chalmers, Rebecca Marshall, David M Maslove, Peter W Stewart, et al. 2019. “Control of Confounding and Reporting of Results in Causal Inference Studies: Guidance for Authors from Editors of Respiratory, Sleep, and Critical Care Journals.” Annals of the American Thoracic Society 16 (1): 22–28.
Schneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology 20 (4): 512–22.
Shortreed, Susan M, and Ashkan Ertefaie. 2017. “Outcome-Adaptive Lasso: Variable Selection for Causal Inference.” Biometrics 73 (4): 1111–22.
Tennant, Paul W, Elizabeth J Murray, Kathryn F Arnold, Leigh Berrie, Matthew P Fox, Samantha C Gadd, and George TH Ellison. 2021. “Use of Directed Acyclic Graphs (DAGs) to Identify Confounders in Applied Health Research: Review and Recommendations.” International Journal of Epidemiology 50 (2): 620–32.
VanderWeele, Tyler J. 2009. “On the Distinction Between Interaction and Effect Modification.” Epidemiology, 863–71.
———. 2019. “Principles of Confounder Selection.” European Journal of Epidemiology 34 (3): 211–19.
Westreich, Daniel, and Sander Greenland. 2013. “The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients.” American Journal of Epidemiology 177 (4): 292–98.