Aqrab is an AI-powered methodology critique engine for clinical researchers. It analyzes study protocols and detects specific biases like immortal time bias, healthy user confounding, and missing sensitivity analyses — providing named issues with actionable fixes.

How is Aqrab different from ChatGPT for research?

Generic AI gives generic advice like "consider confounding." Aqrab identifies specific biases by name (immortal time bias, competing risks, positivity violations), explains the mechanism, and provides concrete fixes. Built by researchers from Mount Sinai, Johns Hopkins, and King Abdulaziz University with 20+ years of causal inference expertise.

Yes — Aqrab offers 3 free critiques with no credit card required. The Pro plan at $99/month includes unlimited critiques, RAG-powered literature analysis, and exportable reports.

What types of studies can Aqrab critique?

Aqrab critiques observational studies (cohort, case-control, cross-sectional), clinical trials (RCTs, adaptive designs), and epidemiological analyses using methods like propensity score matching, instrumental variables, difference-in-differences, and target trial emulation.

← Back to Blog

Causal InferencePSMObservational Studies

Propensity Score Matching: A Practical Guide for Clinical Researchers

Name: Aqrab
Author: Coefficients Health Analytics

March 26, 2026·12 min read·By Coefficients Health Analytics

Propensity score matching (PSM) is one of the most widely used — and most frequently misapplied — methods in observational clinical research. This guide covers what PSM actually does, when it works, when it fails, and how to report it in a way that satisfies reviewers and holds up to methodological scrutiny.

In this guide

1. What PSM actually does (and doesn't do)
2. When to use PSM — and when not to
3. Five ways PSM fails in practice
4. What reviewers expect: the reporting checklist
5. When to use IPTW or other methods instead
6. Getting automated critique with Aqrab

1. What PSM actually does (and doesn't do)

A propensity score is the predicted probability of receiving treatment, given a set of observed covariates. Matching on this score creates a pseudo-population where treated and untreated groups are balanced on those observed covariates — mimicking what a randomized trial would have achieved.

✓ What it does

Balances observed covariates between treatment groups
Reduces dimensionality of confounding adjustment to a single score
Provides an intuitive "like-with-like" comparison
Forces researchers to think about the treatment assignment mechanism

✗ What it does NOT do

Control for unmeasured confounders — PSM cannot fix what you cannot measure
Replace randomization — observational is still observational
Guarantee causal inference — it reduces, but does not eliminate, bias
Handle time-varying confounding — standard PSM is for baseline treatment only

The fundamental limitation is unmeasured confounding. If a variable influences both treatment assignment and the outcome but isn't in your data, PSM cannot help. This is why reporting should always include sensitivity analysis (e.g., E-value) to quantify how strong unmeasured confounding would need to be to explain away the result.

2. When to use PSM — and when not to

Good candidates for PSM

• Binary treatment with clear assignment mechanism
• Large sample (n > 500) with reasonable overlap
• Rich covariate data on confounders
• Goal: estimate ATT (average treatment effect on the treated)
• Target audience expects matched-cohort design

Consider alternatives when

• Treatment is continuous or multi-level
• Small sample (< 200 per group)
• Poor overlap in propensity scores
• Time-varying treatment exposure
• You need ATE, not ATT
• Rare outcomes requiring maximum power

A critical question most researchers skip: what is your estimand? PSM naturally estimates the ATT — the effect of treatment on those who were actually treated. If you need the ATE (the effect across the entire population), IPTW is generally more appropriate. Confusing these estimands is one of the most common errors reviewers catch.

3. Five ways PSM fails in practice

Critical

1. Immortal time bias

If treatment exposure is defined in a way that requires patients to survive a certain period (e.g., "patients who received Drug X during hospitalization"), then the treated group is inherently biased toward survival. PSM cannot fix this — you need landmark analysis or emulated target trial design.

Critical

2. No balance verification

Running PSM without checking whether covariates are actually balanced after matching is disturbingly common. Always report standardized mean differences (SMDs). The threshold: SMD < 0.1 is considered well-balanced. A Love plot makes this immediately visual.

Major

3. Conditioning on post-treatment variables

Including variables that are affected by treatment in the propensity score model introduces collider bias. Only variables measured before (or at the time of) treatment assignment should enter the model. A DAG helps clarify this.

Major

4. Ignoring unmatched subjects

PSM discards unmatched patients. If you lose 40% of your sample to matching, your results may not generalize to the full population. Always report: how many subjects were matched, what the matched cohort looks like, and whether the unmatched differ meaningfully.

Major

5. No sensitivity analysis

Without an E-value or similar sensitivity measure, you cannot quantify vulnerability to unmeasured confounding. Reviewers at top journals increasingly require this. The E-value tells readers: "an unmeasured confounder would need to have at least X association with both treatment and outcome to explain away our result."

4. What reviewers expect: the reporting checklist

Journals increasingly follow structured reporting guidelines for propensity score studies. Here's the minimum reviewers expect:

Propensity score model: covariates, model type (logistic regression vs. GBM), variable selection rationale

Matching algorithm: nearest-neighbor, caliper width (typically 0.2 × SD of logit PS), with/without replacement

Balance table: SMDs before and after matching for all covariates (Love plot preferred)

Sample retention: N matched / N total, characteristics of unmatched subjects

Estimand stated explicitly: ATT vs. ATE, and why

Outcome model: what was run on the matched cohort (conditional vs. marginal)

Variance estimation: account for matching-induced correlation (cluster-robust SE, bootstrap, or Abadie-Imbens)

Sensitivity analysis: E-value for primary finding + confidence interval bound

Subgroup / interaction: pre-specified, not data-dredged

5. When to use IPTW or other methods instead

MethodBest forEstimand

PSMBinary treatment, large N, ATT desiredATT

IPTWATE needed, want full sample, continuous PSATE or ATT

StratificationSimple adjustment, exploratoryATE (approximate)

Covariate adjustmentSmall samples, low dimensionalityConditional ATE

Doubly robust (AIPW)Insurance against model misspecificationATE

Instrumental variablesStrong unmeasured confounding, valid instrumentLATE

Regression discontinuityTreatment assigned by thresholdLATE

The choice isn't "PSM vs. everything else." It's about which method best answers your specific research question given the data structure, sample size, and target estimand. A DAG should guide this decision — not convention or reviewer preference.

6. Getting automated critique with Aqrab

Aqrab is a methodology critique engine built by researchers from Mount Sinai, Johns Hopkins, and King Abdulaziz University. Paste your study protocol and get a structured critique — including bias detection, estimand alignment, and the specific fixes listed above — in under 60 seconds.

Try Aqrab free

3 free critiques. No credit card. See if Aqrab catches what you missed.

Start a critique →

Propensity Score Matching: A Practical Guide for Clinical Researchers

In this guide

1. What PSM actually does (and doesn't do)

2. When to use PSM — and when not to

3. Five ways PSM fails in practice

1. Immortal time bias

2. No balance verification

3. Conditioning on post-treatment variables

4. Ignoring unmatched subjects

5. No sensitivity analysis

4. What reviewers expect: the reporting checklist

5. When to use IPTW or other methods instead

6. Getting automated critique with Aqrab

Try Aqrab free

Further Reading