Inverse Probability Weighting: When PSM Discards Your Data
Propensity score matching (PSM) has a dirty secret: it throws away patients. Every unmatched subject is deleted from the analysis. Inverse probability weighting (IPW) keeps everyone — and lets the math decide who matters. This guide explains why weighting is usually the better choice, and how to use it without getting burned by the assumptions you never see coming.
In this guide
- 1. Why PSM discards data (and why that matters)
- 2. How IPW creates a pseudo-population
- 3. The positivity assumption (and when it breaks)
- 4. Stabilized weights vs unstabilized IPTW
- 5. Marginal structural models: IPW for time-varying treatments
- 6. Five ways IPW analyses go wrong
- 7. IPW vs PSM vs G-computation: when to use each
- 8. Reporting checklist for IPW studies
- 9. Automated IPW critique with Aqrab
1. Why PSM discards data (and why that matters)
PSM matches each treated patient to one or more controls with similar propensity scores. The problem: patients without good matches are dropped. In studies with poor overlap — where treated and untreated populations are fundamentally different — PSM can discard 20–40% of your sample.
⚠️ The bias-variance tradeoff PSM ignores
PSM improves bias by creating comparable groups — but at the cost of reduced sample size. You trade variance for bias reduction, but you never report how much information you lost. IPW achieves the same bias reduction without discarding anyone.
Worse, the patients PSM drops are often the most informative: extreme cases with unique covariate combinations that would help estimate the treatment effect at the margins. By deleting them, PSM narrows your estimand without telling you.
2. How IPW creates a pseudo-population
Instead of matching and deleting, IPW keeps every patient and assigns a weight. Patients who are underrepresented in their treatment group get more weight. Patients who are overrepresented get less.
The weight for each patient is the inverse of their probability of receiving the treatment they actually received, given their covariates:
Where t_i is the treatment patient i actually received, and X_i are their covariates. A treated patient with a predicted probability of 0.2 gets a weight of 5 — they are rare in the treated group, so they count more. A treated patient with probability 0.8 gets weight 1.25 — they are typical, so they count about normal.
When you weight the entire sample by these inverse probabilities, the resulting pseudo-population has the covariate distribution of a randomized trial. Treatment is independent of X in the weighted sample.
3. The positivity assumption (and when it breaks)
IPW requires that every patient has a non-zero probability of receiving either treatment. If some patients could only have received one treatment — ever — the positivity (or experimental treatment assignment) assumption is violated.
⚠️ Near-violation is the real danger
Complete positivity violations are obvious. Near-violations are not. A patient with a predicted probability of 0.001 gets a weight of 1,000 — one patient now dominates your analysis and the variance explodes. This is the most common failure mode in published IPW studies.
Clinical examples of positivity violations:
- Neonates with birth weight under 500g never receive certain surgeries — treatment assignment is deterministic
- Patients with severe liver disease are contraindicated for specific drugs — the indication makes treatment certain
- Age-based guidelines create hard cutoffs — patients over 80 are systematically excluded from treatment
How to diagnose it: Plot the distribution of propensity scores stratified by treatment. Where the two distributions barely overlap, positivity is near-violated. Weight truncation (see below) is one solution, but it reintroduces some bias.
4. Stabilized weights vs unstabilized IPTW
The weight formula above is the basic inverse probability of treatment weight (IPTW). It works, but it inflates variance unnecessarily. Stabilized weights fix this:
The numerator is the marginal probability of receiving the treatment (marginal over all patients). The denominator is the conditional probability given covariates. The ratio centers the mean weight near 1.0 instead of allowing it to drift.
When to use stabilized vs unstabilized:
- Unstabilized IPTW: Estimates the marginal treatment effect (ATE). Appropriate when you want the population-average effect.
- Stabilized IPTW: Estimates the marginal effect with tighter confidence intervals. Also appropriate for ATE estimation.
- Inverse probability of censoring weights (IPCW): Use stabilized weights when adjusting for informative censoring or missing data.
The choice between stabilized and unstabilized does not change the point estimate much — but it dramatically affects standard errors. Always report which you used.
5. Marginal structural models: IPW for time-varying treatments
Most IPW guides stop at single-timepoint treatment. But clinical research often involves time-varying treatments — a patient starts a drug, switches doses, stops, restarts. Traditional regression fails here because of time-dependent confounding: the treatment at time t affects future confounders, which also affect future treatment.
Marginal structural models (MSMs) extend IPW to this setting using inverse probability of treatment and censoring weights (IPTC weights). Each patient is weighted by the cumulative inverse probability of following their observed treatment and censoring history:
This is where IPW becomes truly powerful: MSMs are one of the few methods that can handle time-varying confounding without G-computation or parametric assumptions about the outcome model.
💡 Clinical example
Studying the effect of statin initiation on cardiovascular events over 10 years, where statin use is modified by annual cholesterol checks (time-varying confounders that also predict future treatment decisions). Standard Cox regression gives biased estimates. MSMs with stabilized weights correctly identify the causal effect of the treatment strategy.
6. Five ways IPW analyses go wrong
Mistake 1: Truncating weights without reporting it
Weight truncation (capping at the 1st/99th percentile) reduces variance from extreme weights. But it introduces bias. If you truncate, report the cutoff and compare truncated vs untruncated estimates in a sensitivity analysis.
Mistake 2: Ignoring model misspecification
IPW is only as good as the propensity score model. If your logistic regression misses nonlinear interactions between covariates, the weights are wrong and bias persists. Modern alternatives: boosted trees, super learner, or targeted maximum likelihood estimation (TMLE) for more robust propensity estimation.
Mistake 3: Not checking covariate balance
After weighting, you must verify that covariates are balanced between groups. Standardized mean differences (SMD) should be below 0.1 for all covariates. If not, your model needs revision. This is not optional — it is the entire point of IPW.
Mistake 4: Using the wrong variance estimator
IPW with non-stabilized weights produces heteroscedastic residuals. Standard software variance formulas are wrong. Use the sandwich (robust) variance estimator or bootstrap. Stata users:svyset with pweights. R users: sandwich package with vcovHC().
Mistake 5: Treating IPW as the final analysis
IPW is a method for creating balance, not an analysis in itself. The weighted analysis still needs a proper outcome model (linear, logistic, Cox, etc.). Researchers who weight and then run t-tests are doing two things wrong at once.
7. IPW vs PSM vs G-computation: when to use each
All three methods estimate the same thing — the average treatment effect — but they handle the assumptions differently:
PSM (Propensity Score Matching)
Pros: Intuitive, widely understood, produces matched pairs reviewers can see.
Cons: Discards unmatched patients, poor efficiency, no clear standard error.
Use when: Reviewers demand it, sample size is large, overlap is excellent.
IPW (Inverse Probability Weighting)
Pros: Keeps all patients, estimates ATE directly, extends to time-varying treatments.
Cons: Extreme weights inflate variance, requires positivity, needs robust variance estimation.
Use when: You want the ATE, you have reasonable overlap, you can check balance.
G-computation / Outcome Regression
Pros: No weight-related instability, natural confidence intervals, works with small samples.
Cons: Heavily model-dependent, biased if outcome model is misspecified.
Use when: You have a well-specified outcome model, small sample, or extreme weights.
Best practice: Run at least two methods. If IPW and g-computation give the same result, the effect is robust to modeling assumptions. If they diverge, the difference tells you which assumption is driving the result. This is called a doubly robust strategy — and it is the gold standard in modern causal inference.
8. Reporting checklist for IPW studies
- 1.Propensity score model specification (covariates, functional form, interactions)
- 2.Distribution of propensity scores by treatment group (overlapping histograms)
- 3.Weight distribution: mean, median, range, proportion exceeding threshold
- 4.Weight truncation: yes/no, cutoff used, sensitivity analysis comparing truncated vs untruncated
- 5.Covariate balance table: SMD before and after weighting (target: all < 0.1)
- 6.Estimand clearly stated: ATE, ATT, or ATET
- 7.Variance estimation method: robust/sandwich, bootstrap, or survey-weighted
- 8.Sensitivity analysis for unmeasured confounding (E-value or Rosenbaum bounds)
- 9.If MSM: cumulative weight distribution, evidence of stability, follow-up duration
- 10.Software and package versions reported
9. Automated IPW critique with Aqrab
Aqrab automatically flags IPW-specific problems in published trials: missing balance tables, untruncated extreme weights, incorrect variance estimators, and unstated estimands. Paste any PMID and get a structured critique in under a minute.
Aqrab checks 16 methodology dimensions across 47 reporting standards (CONSORT, STROBE, PROBAST, PRISMA, SPIRIT, ARRIVE, MOOSE, STARD, TRIPOD, RECODE). Every critique scores study design, outcome measurement, confounding control, sample size, and clinical relevance.
Try Aqrab Free →Published by Coefficients Health Analytics. Part of our causal inference guide series.