Instrumental Variables: When Observational Data Meets Unmeasured Confounding
Every propensity score analysis carries the same caveat: "we cannot rule out unmeasured confounding." Instrumental variable (IV) methods are one of the few approaches that can address this — under the right conditions. This guide covers what instruments are, where to find them in clinical research, and why most IV analyses fail on assumptions researchers never check.
In this guide
- 1. What an instrumental variable actually is
- 2. The three conditions (and why the third is untestable)
- 3. Common instruments in clinical research
- 4. LATE: the estimand you didn't know you were estimating
- 5. Four ways IV analyses go wrong
- 6. Reporting checklist for IV studies
- 7. Mendelian randomization: genes as instruments
- 8. Automated IV critique with Aqrab
1. What an instrumental variable actually is
An instrument is a variable that affects the outcome only through the treatment. It nudges people toward or away from treatment without directly influencing the outcome itself. Think of it as a natural experiment embedded in observational data — something that shifts treatment assignment quasi-randomly.
The classic analogy: distance to hospital
Patients who live closer to a specialized hospital are more likely to receive the specialized procedure. Living distance doesn't directly cause better or worse outcomes — it just changes where you get treated, and therefore what treatment you get. That's the instrument.
The power of IV methods is that they can handle unmeasured confounders — the exact problem that propensity score methods cannot solve. The cost is a narrower estimand and stronger assumptions.
2. The three conditions (and why the third is untestable)
For a variable Z to be a valid instrument for the effect of treatment D on outcome Y, three conditions must hold:
1. Relevance — Z must predict treatment D
The instrument must have a strong first-stage effect on the treatment. A weak instrument (F-statistic < 10 in the first stage) inflates standard errors and biases results toward the OLS estimate. This is the only condition you can fully test with your data.
2. Independence — Z is as good as randomly assigned
The instrument should not share common causes with the outcome. You can check balance on observables (like checking covariate balance in an RCT), but you cannot test whether unmeasured factors correlate with Z. DAGs help here — draw the assumed causal structure and look for back-door paths from Z to Y.
3. Exclusion restriction — Z affects Y only through D
This is the assumption that breaks most IV analyses. The instrument must have no direct effect on the outcome — it works only by changing treatment. You cannot test this with data. It must be defended on substantive, domain-knowledge grounds. If this fails, your IV estimate is biased in unknown directions.
The asymmetry is important: relevance is empirically checkable, independence is partially checkable, but the exclusion restriction is purely an argument. This is why IV papers live or die on their narrative defense of this assumption.
3. Common instruments in clinical research
The best instruments come from institutional or natural variation that is plausibly unrelated to patient characteristics. Randomization from a prior trial is the gold standard. Everything else requires careful justification.
4. LATE: the estimand you didn't know you were estimating
IV methods do not estimate the Average Treatment Effect (ATE). They estimate the Local Average Treatment Effect (LATE) — the effect of treatment among "compliers," the subpopulation whose treatment status is changed by the instrument.
The four principal strata
Compliers
Take treatment when Z=1, don't when Z=0. The group IV identifies.
Always-takers
Take treatment regardless of Z. IV learns nothing about them.
Never-takers
Refuse treatment regardless of Z. IV learns nothing about them.
Defiers
Do the opposite of Z. Assumed not to exist (monotonicity assumption).
This matters because LATE may not generalize. If you use distance to hospital as an instrument, you're estimating the treatment effect for patients whose treatment choice was influenced by geography — not the effect for all patients. Whether this subpopulation is clinically interesting is a question you must answer before choosing IV methods.
5. Four ways IV analyses go wrong
1. Weak instruments
An instrument with a first-stage F-statistic below 10 produces biased estimates with inflated confidence intervals. With weak instruments, 2SLS is biased toward OLS — defeating the entire purpose. Use the Stock-Yogo critical values or the effective F-statistic (Olea & Pflueger, 2013) for proper diagnosis.
2. Exclusion restriction violations
The most common failure. Distance to hospital may correlate with SES. Physician preference may correlate with expertise. Calendar time instruments may co-occur with other policy changes. If Z has any direct path to Y, the estimate is biased — and you cannot detect this from data alone.
3. Monotonicity violations
IV assumes no defiers — no patients who systematically do the opposite of what the instrument predicts. In practice, this fails when treatment decisions involve complex trade-offs. A patient near a surgical center might specifically seek conservative treatment elsewhere because they distrust the local hospital.
4. Overinterpreting LATE as ATE
Researchers frequently present IV estimates as if they apply to the full population. They do not. If only 15% of patients are compliers, your estimate reflects a narrow subgroup. Always discuss the likely characteristics of compliers and whether the LATE is policy-relevant.
6. Reporting checklist for IV studies
IV studies face higher scrutiny because the assumptions are strong and untestable. Here is what reviewers expect to see:
Instrument definition: what is Z, how is it measured, and why is it plausibly exogenous
DAG: directed acyclic graph showing assumed causal structure with Z, D, Y, and potential confounders
First-stage strength: F-statistic (≥ 10 minimum, ≥ 104.7 for 5% max relative bias with one instrument)
Balance check: show that the instrument is not associated with measured confounders
Exclusion restriction defense: substantive argument for why Z cannot directly affect Y
Monotonicity argument: why defiers are implausible in this clinical setting
Estimand stated explicitly: LATE, and description of the likely complier subpopulation
Method: 2SLS, LIML, or other estimator with justification
Sensitivity analysis: what if the exclusion restriction is slightly violated? (Conley et al., 2012; van Kippersluis & Rietveld, 2018)
Comparison with OLS/PSM: show results side-by-side, discuss why they differ
Overidentification test (if multiple instruments): Hansen J test for instrument validity
7. Mendelian randomization: genes as instruments
Mendelian randomization (MR) uses genetic variants as instruments. Because alleles are assigned at conception (Mendel's second law), they satisfy the independence condition naturally — they're randomly distributed with respect to most confounders.
Strengths
- • Independence is biologically grounded
- • Large GWAS datasets readily available
- • Two-sample MR avoids need for individual-level data
- • Can test causal direction (bidirectional MR)
Threats
- • Horizontal pleiotropy — gene affects Y through paths other than D
- • Population stratification — ancestry confounds
- • Dynastic effects — parental genotype affects offspring environment
- • Weak instrument bias (rare variants, small effect sizes)
- • Winner's curse — GWAS hits may overestimate instrument strength
MR has become one of the most active areas in causal inference. Methods like MR-Egger, weighted median, and MR-PRESSO help detect and correct for pleiotropy. But the fundamental tension remains: the more variants you include for power, the higher the risk that at least one violates the exclusion restriction.
8. Automated IV critique with Aqrab
Aqrab checks your IV assumptions before a reviewer does. Paste your study protocol and get a structured critique — instrument strength assessment, exclusion restriction analysis, estimand alignment check, and specific recommendations for strengthening your analysis.
Try Aqrab free
3 free critiques. No credit card. See if Aqrab catches the exclusion restriction issue you missed.
Further Reading
- Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. Journal of the American Statistical Association. 1996;91(434):444-455.
- Hernán MA, Robins JM. Instruments for causal inference: an epidemiologist's dream? Epidemiology. 2006;17(4):360-372.
- Davies NM, Holmes MV, Davey Smith G. Reading Mendelian randomisation studies: a guide, glossary, and checklist for clinicians. BMJ. 2018;362:k601.
- Stock JH, Yogo M. Testing for weak instruments in linear IV regression. In: Identification and Inference for Econometric Models. Cambridge University Press. 2005.
- Swanson SA, Hernán MA. Commentary: how to report instrumental variable analyses. Epidemiology. 2013;24(3):370-374.
- Burgess S, Thompson SG. Mendelian Randomization: Methods for Using Genetic Variants in Causal Estimation. Chapman & Hall/CRC. 2015.