Propensity Score Matching: A Practical Guide for Clinical Researchers
Propensity score matching (PSM) is one of the most widely used — and most frequently misapplied — methods in observational clinical research. This guide covers what PSM actually does, when it works, when it fails, and how to report it in a way that satisfies reviewers and holds up to methodological scrutiny.
In this guide
1. What PSM actually does (and doesn't do)
A propensity score is the predicted probability of receiving treatment, given a set of observed covariates. Matching on this score creates a pseudo-population where treated and untreated groups are balanced on those observed covariates — mimicking what a randomized trial would have achieved.
✓ What it does
- Balances observed covariates between treatment groups
- Reduces dimensionality of confounding adjustment to a single score
- Provides an intuitive "like-with-like" comparison
- Forces researchers to think about the treatment assignment mechanism
✗ What it does NOT do
- Control for unmeasured confounders — PSM cannot fix what you cannot measure
- Replace randomization — observational is still observational
- Guarantee causal inference — it reduces, but does not eliminate, bias
- Handle time-varying confounding — standard PSM is for baseline treatment only
The fundamental limitation is unmeasured confounding. If a variable influences both treatment assignment and the outcome but isn't in your data, PSM cannot help. This is why reporting should always include sensitivity analysis (e.g., E-value) to quantify how strong unmeasured confounding would need to be to explain away the result.
2. When to use PSM — and when not to
Good candidates for PSM
- • Binary treatment with clear assignment mechanism
- • Large sample (n > 500) with reasonable overlap
- • Rich covariate data on confounders
- • Goal: estimate ATT (average treatment effect on the treated)
- • Target audience expects matched-cohort design
Consider alternatives when
- • Treatment is continuous or multi-level
- • Small sample (< 200 per group)
- • Poor overlap in propensity scores
- • Time-varying treatment exposure
- • You need ATE, not ATT
- • Rare outcomes requiring maximum power
A critical question most researchers skip: what is your estimand? PSM naturally estimates the ATT — the effect of treatment on those who were actually treated. If you need the ATE (the effect across the entire population), IPTW is generally more appropriate. Confusing these estimands is one of the most common errors reviewers catch.
3. Five ways PSM fails in practice
1. Immortal time bias
If treatment exposure is defined in a way that requires patients to survive a certain period (e.g., "patients who received Drug X during hospitalization"), then the treated group is inherently biased toward survival. PSM cannot fix this — you need landmark analysis or emulated target trial design.
2. No balance verification
Running PSM without checking whether covariates are actually balanced after matching is disturbingly common. Always report standardized mean differences (SMDs). The threshold: SMD < 0.1 is considered well-balanced. A Love plot makes this immediately visual.
3. Conditioning on post-treatment variables
Including variables that are affected by treatment in the propensity score model introduces collider bias. Only variables measured before (or at the time of) treatment assignment should enter the model. A DAG helps clarify this.
4. Ignoring unmatched subjects
PSM discards unmatched patients. If you lose 40% of your sample to matching, your results may not generalize to the full population. Always report: how many subjects were matched, what the matched cohort looks like, and whether the unmatched differ meaningfully.
5. No sensitivity analysis
Without an E-value or similar sensitivity measure, you cannot quantify vulnerability to unmeasured confounding. Reviewers at top journals increasingly require this. The E-value tells readers: "an unmeasured confounder would need to have at least X association with both treatment and outcome to explain away our result."
4. What reviewers expect: the reporting checklist
Journals increasingly follow structured reporting guidelines for propensity score studies. Here's the minimum reviewers expect:
Propensity score model: covariates, model type (logistic regression vs. GBM), variable selection rationale
Matching algorithm: nearest-neighbor, caliper width (typically 0.2 × SD of logit PS), with/without replacement
Balance table: SMDs before and after matching for all covariates (Love plot preferred)
Sample retention: N matched / N total, characteristics of unmatched subjects
Estimand stated explicitly: ATT vs. ATE, and why
Outcome model: what was run on the matched cohort (conditional vs. marginal)
Variance estimation: account for matching-induced correlation (cluster-robust SE, bootstrap, or Abadie-Imbens)
Sensitivity analysis: E-value for primary finding + confidence interval bound
Subgroup / interaction: pre-specified, not data-dredged
5. When to use IPTW or other methods instead
The choice isn't "PSM vs. everything else." It's about which method best answers your specific research question given the data structure, sample size, and target estimand. A DAG should guide this decision — not convention or reviewer preference.
6. Getting automated critique with Aqrab
Aqrab is a methodology critique engine built by researchers from Mount Sinai, Johns Hopkins, and King Abdulaziz University. Paste your study protocol and get a structured critique — including bias detection, estimand alignment, and the specific fixes listed above — in under 60 seconds.
Try Aqrab free
3 free critiques. No credit card. See if Aqrab catches what you missed.
Further Reading
- Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research. 2011;46(3):399-424.
- Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41-55.
- VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Annals of Internal Medicine. 2017;167(4):268-274.
- Stuart EA. Matching methods for causal inference: a review and a look forward. Statistical Science. 2010;25(1):1-21.
- Hernán MA, Robins JM. Causal Inference: What If. Chapman & Hall/CRC. 2020.