Directed Acyclic Graphs (DAGs)
Spring 2026
By the end of this lecture, you will be able to:
Reading
Huntington-Klein, The Effect: Chapters 6, 7, and 8 (available free at theeffectbook.net)
Every dataset we observe is produced by a data generating process (DGP) — the real-world laws, behaviors, and institutions that determine the values we see.
Our challenge as econometricians:
Identification
Identification is the process of figuring out what part of the variation in your data answers your research question.
Association
Two variables move together (correlation). \(X\) and \(Y\) are related statistically.
Causation
Changing \(X\) causes \(Y\) to change. Formally: if we could intervene to change \(X\), the distribution of \(Y\) would change as a result.
A directed acyclic graph (DAG) is a visual representation of a data generating process. Every word in the name matters:
The acyclic part is a real constraint: DAGs assume causality flows forward. If \(X\) causes \(Y\) and \(Y\) simultaneously causes \(X\), a DAG can’t represent that. (We’ll return to this when we discuss simultaneity.)
Key insight: Once we draw the DAG, we can mechanically determine which variables to control for to identify a causal effect.
Important assumption
Every variable and arrow that is not on the diagram is an assumption we’re making. Drawing a DAG forces you to be explicit about your identifying assumptions.
Node: A variable in the system.
Each circle represents a variable — either observed or unobserved.
Arrow: \(X \rightarrow Y\) means “\(X\) causes \(Y\).”


Variables:
Causal arrows:
Building a DAG is a structured process (HK, The Effect, Ch. 6):
Step 1: List all relevant variables. What might cause your treatment? What might cause your outcome? What affects both? Cast a wide net.
Step 2: For each pair of variables, ask “does one cause the other?” If yes, draw an arrow from cause to effect. If you draw no arrow, you’re assuming no direct causal link between those variables.
Step 3: Simplify. Remove any variable that is not on any path between your treatment and outcome. If it doesn’t connect \(X\) and \(Y\) (directly or indirectly), it doesn’t affect your identification — so it can be dropped.
Step 4: Determine what to control for. Apply the backdoor criterion to find the minimal set of controls that closes all backdoor paths. (We’ll get to this soon!)
What if two variables are correlated but neither causes the other?
Sometimes you have two variables — say, Income and SES — that are clearly correlated, but you don’t think one directly causes the other.
Solution: Introduce a shared unobserved cause \(U\) that drives both.
\[U \rightarrow \text{Income} \quad \text{and} \quad U \rightarrow \text{SES}\]
This is more honest than arbitrarily picking a direction for the arrow. You’re saying: “something generates both of these, but I don’t need to name it.” In a DAG, draw \(U\) as a gray (unobserved) node with arrows pointing to both variables.
\(X \rightarrow Y\) (or \(X \rightarrow M \rightarrow Y\))
Direct or indirect effect of \(X\) on \(Y\), with all arrows pointing away from \(X\).
These are good paths — they represent what we want to estimate.
Example: Class Size → Test Scores
\(X \leftarrow Z \rightarrow Y\)
\(X\) and \(Y\) are connected through a common cause \(Z\). These are bad paths — they create confounding.
Example: Class Size ← Wealth → Test Scores

\(X \rightarrow Z \leftarrow Y\)
\(X\) and \(Y\) both cause \(Z\). These paths are closed by default — no confounding flows through them unless we control for \(Z\).
When we say “control for \(Z\)” in a regression, we are comparing units that have the same value of \(Z\). This removes the variation in \(X\) and \(Y\) that is driven by \(Z\).
In DAG terms: Controlling for a variable blocks all paths that pass through it. This can be good or bad:
| Path Type | Default Status | What controlling does |
|---|---|---|
| Backdoor (\(X \leftarrow Z \rightarrow Y\)) | Open | Closes it → removes bias |
| Collider (\(X \rightarrow Z \leftarrow Y\)) | Closed | Opens it → creates bias |
| Mediator (\(X \rightarrow Z \rightarrow Y\)) | Open | Closes it → blocks the causal effect |
Our goal: Close all bad (backdoor) paths while keeping good (causal) paths open — and not accidentally opening closed ones.
Confounding
When \(X\) and \(Y\) share a common cause \(Z\), they are confounded. A regression of \(Y\) on \(X\) will pick up both the causal effect AND the confounding association.
\[\text{Association} = \underbrace{\text{Causal Effect}}_{\text{Class Size} \rightarrow \text{Scores}} + \underbrace{\text{Confounding Bias}}_{\text{via Wealth}}\]
Remember omitted variable bias from Chapter 6? Every case of OVB is an open backdoor path. The omitted variable creates a backdoor path that stays open, and the bias flows through it into \(\hat{\beta}_1\).
Think About It
If we regress Test Scores on Class Size without controlling for Wealth, will the class size coefficient be biased up or down?
Answer: The coefficient on class size is biased away from zero (overstates the negative effect).
To identify the causal effect of \(X\) on \(Y\):
Backdoor Criterion
Control for a set of variables \(\mathbf{Z}\) such that:
In plain language:
A systematic approach to finding your controls:
Example: Class Size and Test Scores
| Path | Type | Default | Action |
|---|---|---|---|
| Class Size → Test Scores | Causal | Open | Leave open |
| Class Size ← Wealth → Test Scores | Backdoor | Open | Control for Wealth |
After controlling for Wealth: the only remaining open path is causal. The effect is identified.
Bad Control: Mediator
If \(X \rightarrow M \rightarrow Y\), then \(M\) is a mediator. Controlling for \(M\) blocks part of the causal effect of \(X\) on \(Y\).
Example: Does education affect earnings?

Gray node = unobserved variable
Challenge: Ability is unobserved! We can’t directly control for it.
This is why economists turn to identification strategies — instruments, experiments, panel data — to close backdoor paths when confounders are unobservable.

The problem without randomization:
. . .
With an RCT:
Important
This is why randomized controlled trials (RCTs) are the gold standard for causal inference — they eliminate confounding mechanically.

Reverse Causation
Sometimes the arrow between \(X\) and \(Y\) runs in the opposite direction from what we assume.
In DAG terms: we’ve drawn the wrong DAG. The true DGP has the arrow reversed — or both arrows exist.
Remember: the “A” in DAG stands for acyclic. DAGs cannot represent simultaneity — when \(X\) causes \(Y\) and \(Y\) simultaneously causes \(X\).
Two approaches when you suspect reverse causation or simultaneity:
Approach 1: Use time subscripts
Separate variables by time period: \(X_t \rightarrow Y_{t+1}\) and \(Y_t \rightarrow X_{t+1}\). This converts a cycle into an acyclic graph — causality flows forward through time. Works well when the feedback loop operates with a lag (e.g., crime this year → police budget next year).
Approach 2: Use instrumental variables
When the relationship is truly simultaneous (e.g., price and quantity in a market), the DAG framework alone can’t solve it. This is where instrumental variables and 2SLS become the right tools — you need an external source of variation that shifts one variable without directly affecting the other. We’ll cover this in detail later in the course.

Gray = unobserved
Measurement Error
When we can’t perfectly measure a variable in our DAG, we introduce measurement error.

Does Income confound the effect of Health Insurance on Health?
To estimate the effect of Health Insurance on Health, should we control for Income?
Answer

Questions:
Answer
St Leger, Cochrane, & Moore (1979). The Lancet.
In 1979, researchers found a strong negative association between wine consumption and heart disease across 18 developed countries.


Red wine sales in the US jumped nearly 40% in 1992. The idea that wine was good for your heart became conventional wisdom.
Moderate wine drinkers tend to be more educated, wealthier, more physically active, and more likely to have health insurance.
And many “non-drinkers” were actually ex-drinkers who had quit because of health problems.
In DAG terms: What kinds of bias are these?

Using Mendelian randomization — genetic variants as instruments — alcohol at all levels was associated with increased cardiovascular risk.

Even among “low-risk” drinkers, alcohol was associated with higher mortality when accounting for health and SES risk factors.
Now apply DAG thinking to this question using the worksheet handout:
Tip from The Effect
Start with your treatment and outcome. Then ask: “What are all the things that cause my treatment? What are all the things that cause my outcome? Do any of those overlap?” Those overlapping causes are your confounders.
The DGP is what we’re trying to understand. A DAG is our best model of that process.
Association ≠ Causation. DAGs help us think systematically about why.
OVB = an open backdoor path. If a confounder is omitted, the path stays open and your estimate is biased.
The Backdoor Criterion is mechanical. Once you draw the DAG, you can determine what to control for.
Collider bias is real. Controlling for the wrong variable can introduce bias by opening a closed path.
Randomization is powerful. Random assignment closes backdoor paths by design.
These ideas connect directly to Chapter 9: Assessing Regression Validity:
The DAG framework gives you a visual language for diagnosing every threat to internal validity.
The backdoor criterion is actually a special case of a more general concept called d-separation.
You don’t need to know d-separation for this course — the backdoor criterion is the tool you’ll use. But if you want to go deeper, see The Effect Chapter 8 or Pearl (2009).
ECON3500 | Causal Diagrams