Aqrab is an AI-powered methodology critique engine for clinical researchers. It analyzes study protocols and detects specific biases like immortal time bias, healthy user confounding, and missing sensitivity analyses — providing named issues with actionable fixes.

How is Aqrab different from ChatGPT for research?

Generic AI gives generic advice like "consider confounding." Aqrab identifies specific biases by name (immortal time bias, competing risks, positivity violations), explains the mechanism, and provides concrete fixes. Built by researchers from Mount Sinai, Johns Hopkins, and King Abdulaziz University with 20+ years of causal inference expertise.

Yes — Aqrab offers 3 free critiques with no credit card required. The Pro plan at $99/month includes unlimited critiques, RAG-powered literature analysis, and exportable reports.

What types of studies can Aqrab critique?

Aqrab critiques observational studies (cohort, case-control, cross-sectional), clinical trials (RCTs, adaptive designs), and epidemiological analyses using methods like propensity score matching, instrumental variables, difference-in-differences, and target trial emulation.

← Back to Blog

Causal InferenceInstrumental VariablesUnmeasured Confounding

Instrumental Variables: When Observational Data Meets Unmeasured Confounding

Name: Aqrab
Author: Coefficients Health Analytics

March 26, 2026·14 min read·By Coefficients Health Analytics

Every propensity score analysis carries the same caveat: "we cannot rule out unmeasured confounding." Instrumental variable (IV) methods are one of the few approaches that can address this — under the right conditions. This guide covers what instruments are, where to find them in clinical research, and why most IV analyses fail on assumptions researchers never check.

In this guide

1. What an instrumental variable actually is
2. The three conditions (and why the third is untestable)
3. Common instruments in clinical research
4. LATE: the estimand you didn't know you were estimating
5. Four ways IV analyses go wrong
6. Reporting checklist for IV studies
7. Mendelian randomization: genes as instruments
8. Automated IV critique with Aqrab

1. What an instrumental variable actually is

An instrument is a variable that affects the outcome only through the treatment. It nudges people toward or away from treatment without directly influencing the outcome itself. Think of it as a natural experiment embedded in observational data — something that shifts treatment assignment quasi-randomly.

The classic analogy: distance to hospital

Patients who live closer to a specialized hospital are more likely to receive the specialized procedure. Living distance doesn't directly cause better or worse outcomes — it just changes where you get treated, and therefore what treatment you get. That's the instrument.

The power of IV methods is that they can handle unmeasured confounders — the exact problem that propensity score methods cannot solve. The cost is a narrower estimand and stronger assumptions.

2. The three conditions (and why the third is untestable)

For a variable Z to be a valid instrument for the effect of treatment D on outcome Y, three conditions must hold:

Testable

1. Relevance — Z must predict treatment D

The instrument must have a strong first-stage effect on the treatment. A weak instrument (F-statistic < 10 in the first stage) inflates standard errors and biases results toward the OLS estimate. This is the only condition you can fully test with your data.

Partially testable

2. Independence — Z is as good as randomly assigned

The instrument should not share common causes with the outcome. You can check balance on observables (like checking covariate balance in an RCT), but you cannot test whether unmeasured factors correlate with Z. DAGs help here — draw the assumed causal structure and look for back-door paths from Z to Y.

Untestable

3. Exclusion restriction — Z affects Y only through D

This is the assumption that breaks most IV analyses. The instrument must have no direct effect on the outcome — it works only by changing treatment. You cannot test this with data. It must be defended on substantive, domain-knowledge grounds. If this fails, your IV estimate is biased in unknown directions.

The asymmetry is important: relevance is empirically checkable, independence is partially checkable, but the exclusion restriction is purely an argument. This is why IV papers live or die on their narrative defense of this assumption.

3. Common instruments in clinical research

InstrumentMechanismExclusion risk

Geographic distanceDistance to specialist center → treatment typeDistance may correlate with SES, rurality, comorbidity burden

Physician preferenceDoctors have stable treatment habitsPreference may correlate with skill, volume, or patient selection

Calendar time / policy changeGuideline or formulary changes shift treatmentCo-interventions may change simultaneously

Day of admissionWeekend vs. weekday affects staffing and proceduresWeekend patients may differ in severity (presentation bias)

Genetic variants (MR)Alleles affect biomarker levelsPleiotropy — gene may affect outcome through other pathways

Randomization in prior trialOriginal trial assignment as instrumentCleanest instrument; exclusion risk minimal if trial was well-conducted

The best instruments come from institutional or natural variation that is plausibly unrelated to patient characteristics. Randomization from a prior trial is the gold standard. Everything else requires careful justification.

4. LATE: the estimand you didn't know you were estimating

IV methods do not estimate the Average Treatment Effect (ATE). They estimate the Local Average Treatment Effect (LATE) — the effect of treatment among "compliers," the subpopulation whose treatment status is changed by the instrument.

The four principal strata

Compliers

Take treatment when Z=1, don't when Z=0. The group IV identifies.

Always-takers

Take treatment regardless of Z. IV learns nothing about them.

Never-takers

Refuse treatment regardless of Z. IV learns nothing about them.

Defiers

Do the opposite of Z. Assumed not to exist (monotonicity assumption).

This matters because LATE may not generalize. If you use distance to hospital as an instrument, you're estimating the treatment effect for patients whose treatment choice was influenced by geography — not the effect for all patients. Whether this subpopulation is clinically interesting is a question you must answer before choosing IV methods.

5. Four ways IV analyses go wrong

Critical

1. Weak instruments

An instrument with a first-stage F-statistic below 10 produces biased estimates with inflated confidence intervals. With weak instruments, 2SLS is biased toward OLS — defeating the entire purpose. Use the Stock-Yogo critical values or the effective F-statistic (Olea & Pflueger, 2013) for proper diagnosis.

Critical

2. Exclusion restriction violations

The most common failure. Distance to hospital may correlate with SES. Physician preference may correlate with expertise. Calendar time instruments may co-occur with other policy changes. If Z has any direct path to Y, the estimate is biased — and you cannot detect this from data alone.

Major

3. Monotonicity violations

IV assumes no defiers — no patients who systematically do the opposite of what the instrument predicts. In practice, this fails when treatment decisions involve complex trade-offs. A patient near a surgical center might specifically seek conservative treatment elsewhere because they distrust the local hospital.

Major

4. Overinterpreting LATE as ATE

Researchers frequently present IV estimates as if they apply to the full population. They do not. If only 15% of patients are compliers, your estimate reflects a narrow subgroup. Always discuss the likely characteristics of compliers and whether the LATE is policy-relevant.

6. Reporting checklist for IV studies

IV studies face higher scrutiny because the assumptions are strong and untestable. Here is what reviewers expect to see:

Instrument definition: what is Z, how is it measured, and why is it plausibly exogenous

DAG: directed acyclic graph showing assumed causal structure with Z, D, Y, and potential confounders

First-stage strength: F-statistic (≥ 10 minimum, ≥ 104.7 for 5% max relative bias with one instrument)

Balance check: show that the instrument is not associated with measured confounders

Exclusion restriction defense: substantive argument for why Z cannot directly affect Y

Monotonicity argument: why defiers are implausible in this clinical setting

Estimand stated explicitly: LATE, and description of the likely complier subpopulation

Method: 2SLS, LIML, or other estimator with justification

Sensitivity analysis: what if the exclusion restriction is slightly violated? (Conley et al., 2012; van Kippersluis & Rietveld, 2018)

Comparison with OLS/PSM: show results side-by-side, discuss why they differ

Overidentification test (if multiple instruments): Hansen J test for instrument validity

7. Mendelian randomization: genes as instruments

Mendelian randomization (MR) uses genetic variants as instruments. Because alleles are assigned at conception (Mendel's second law), they satisfy the independence condition naturally — they're randomly distributed with respect to most confounders.

Strengths

• Independence is biologically grounded
• Large GWAS datasets readily available
• Two-sample MR avoids need for individual-level data
• Can test causal direction (bidirectional MR)

Threats

• Horizontal pleiotropy — gene affects Y through paths other than D
• Population stratification — ancestry confounds
• Dynastic effects — parental genotype affects offspring environment
• Weak instrument bias (rare variants, small effect sizes)
• Winner's curse — GWAS hits may overestimate instrument strength

MR has become one of the most active areas in causal inference. Methods like MR-Egger, weighted median, and MR-PRESSO help detect and correct for pleiotropy. But the fundamental tension remains: the more variants you include for power, the higher the risk that at least one violates the exclusion restriction.

8. Automated IV critique with Aqrab

Aqrab checks your IV assumptions before a reviewer does. Paste your study protocol and get a structured critique — instrument strength assessment, exclusion restriction analysis, estimand alignment check, and specific recommendations for strengthening your analysis.

Try Aqrab free

3 free critiques. No credit card. See if Aqrab catches the exclusion restriction issue you missed.

Start a critique →