Collider

In causal inference, understanding the role of colliders is crucial. A collider is a variable that is a common effect of two or more variables. Adjusting for a collider can introduce bias into your estimates.

# Load required packages
library(simcausal)

In a DAG, a collider is a variable influenced by two or more other variables. In our case, L is a collider because it is affected by both A (the treatment) and Y (the outcome). When you adjust for a collider like L, you could introduce bias into your estimates, as demonstrated in the examples below.

Let us consider

L is continuous variable
A is binary treatment/exposure
Y is continuous outcome

An example of A could be a genetic variable (e.g., skin color), Y could be an environmental variable (e.g., indoor air pollution), and L could be disease conditions (e.g., number of comorbidities) (detailed expamles here).

Non-null effect

True treatment effect = 1.3

Data generating process

D <- DAG.empty()
D <- D + 
  node("A", distr = "rbern", prob = plogis(-10)) +
  node("Y", distr = "rnorm", mean = 1.3 * A, sd = .1) +
  node("L", distr = "rnorm", mean = 10 * Y + 1.3 * A, sd = 1)
Dset <- set.DAG(D)

Generate DAG

plotDAG(Dset, xjitter = 0.1, yjitter = .9,
        edge_attrs = list(width = 0.5, arrow.width = 0.4, arrow.size = 0.7),
        vertex_attrs = list(size = 12, label.cex = 0.8))

Generate data

require(simcausal)
Obs.Data <- sim(DAG = Dset, n = 1000000, rndseed = 123)
head(Obs.Data)

Estimate effect

# Not adjusted for L
fit0 <- glm(Y ~ A, family="gaussian", data=Obs.Data)
round(coef(fit0),2)
#> (Intercept)           A 
#>        0.00        1.29

# Adjusted for L
fit <- glm(Y ~ A + L, family="gaussian", data=Obs.Data)
round(coef(fit),2)
#> (Intercept)           A           L 
#>        0.00        0.58        0.05

Important

When not adjusting for L, we recover the true effect close to 1.3. Adjusting for L introduces bias, making the estimate unreliable.

Null effect

True treatment effect = 0

Data generating process

D <- DAG.empty()
D <- D + 
  node("A", distr = "rbern", prob = plogis(-10)) +
  node("Y", distr = "rnorm", mean = 0, sd = .1) +
  node("L", distr = "rnorm", mean = 10 * Y + 1.3 * A, sd = 1)
Dset <- set.DAG(D)

Generate DAG

plotDAG(Dset, xjitter = 0.1, yjitter = .9,
        edge_attrs = list(width = 0.5, arrow.width = 0.4, arrow.size = 0.7),
        vertex_attrs = list(size = 12, label.cex = 0.8))

Generate data

require(simcausal)
Obs.Data <- sim(DAG = Dset, n = 1000000, rndseed = 123)
head(Obs.Data)

Estimate effect

# Not adjusted for L
fit0 <- glm(Y ~ A, family="gaussian", data=Obs.Data)
round(coef(fit0),2)
#> (Intercept)           A 
#>        0.00       -0.01

# Adjusted for L
fit <- glm(Y ~ A + L, family="gaussian", data=Obs.Data)
round(coef(fit),2)
#> (Intercept)           A           L 
#>        0.00       -0.07        0.05

Important

When the true effect is null, not adjusting for L shows an estimate close to zero. Adjusting for L moves the estimate away from the null value, introducing bias.

Even 1,000,000 observations were not enough to recover true treatment effect! But we are close enough.