← Back to Blog
Causal InferenceMachine LearningHigh-Dimensional Data

Double Machine Learning: A Practical Guide for Clinical Researchers

April 1, 2026·15 min read·By Coefficients Health Analytics

Double Machine Learning (DML) is the causal inference method that finally lets you use machine learning for confounder adjustment without breaking everything. Developed by Chernozhukov, Chetverikov, Demirer, Duflo, Newey, Robins, and others, DML solves a problem that bedrock methods like regression and propensity scores struggle with: what do you do when you have dozens (or hundreds) of covariates and no clear DAG? This guide covers what DML does, when to use it, how it fails, and how to report it.

1. What DML actually does

The core idea of DML is deceptively simple. You have a causal question — say, "does this drug reduce hospitalization risk?" — and a dataset with many potential confounders (demographics, comorbidities, lab values, medications, utilization history). Traditional regression requires you to correctly specify the functional form for every single confounder. Get one wrong and bias creeps in.

DML lets you throw machine learning models at the confounders — lasso, random forests, gradient boosting, neural networks, whatever works — to flexibly predict (a) the treatment from confounders and (b) the outcome from confounders. Then it uses the residuals from those predictions to estimate the causal effect of treatment on outcome. This "doubling" — two ML steps, one final regression — is where the name comes from.

✓ What it does

  • Automates confounder adjustment when you have many covariates and no clear functional form
  • Protects against regularization bias — using ML alone introduces bias; DML removes it
  • Provides valid confidence intervals even when the ML models are misspecified
  • Estimates the average treatment effect (ATE) or conditional average treatment effect (CATE)
  • Works with both binary and continuous treatments

✗ What it does NOT do

  • It does NOT identify which variables are confounders — you must decide which covariates to include
  • It does NOT handle unmeasured confounding — if you miss a key confounder, bias remains
  • It does NOT replace a DAG — DML is an estimation method, not an identification strategy
  • It does NOT work well with very small samples (n < 200) where cross-fitting creates noisy estimates
  • It does NOT automatically select the right ML algorithm — you must validate model performance

The key theoretical property is Neyman orthogonality: even if your ML models for the nuisance functions (the confounder models) are imperfect, the final treatment effect estimate remains consistent. This is the "double" protection — you only need to get the confounders approximately right, not perfectly right.

2. Cross-fitting: why it matters

You cannot use the same data to fit the ML models AND estimate the treatment effect. If you do, you get overfitting bias — the same problem that makes lasso estimates of treatment effects biased toward zero. DML solves this with cross-fitting.

Here's how it works. Split the data into K folds (typically K = 5). For each fold:

  1. 1Fit the ML models (treatment and outcome) on the other K-1 folds.
  2. 2Use those fitted models to predict treatment and outcome for the held-out fold.
  3. 3Compute residuals (predicted treatment vs. actual treatment, predicted outcome vs. actual outcome) on the held-out fold.
  4. 4Regress outcome residuals on treatment residuals to get the causal effect estimate.

Average the K estimates. Because each fold was predicted by a model trained on different data, there's no overfitting contamination. This is identical in spirit to cross-validation, but used for causal estimation rather than prediction accuracy.

⚡ Practical note

The number of folds matters less than you think — 5-fold and 10-fold give similar results. What matters is that you do cross-fitting. Using the same data for both ML fitting and effect estimation invalidates your confidence intervals.

3. When DML shines — and when it doesn't

Strong use cases

  • • Large observational datasets (claims, EHR) with 20+ covariates
  • • Complex confounder relationships (interactions, non-linearities, high-dimensional data like diagnoses codes)
  • • When you don't trust your functional form assumptions in regression
  • • Binary or continuous treatment exposure
  • • Research question: average treatment effect (ATE)
  • • Electronic health record data with thousands of variables

Consider alternatives when

  • • Small sample (n < 200 per group) — DML variance explodes
  • • You have a clear, simple DAG with few confounders — standard regression is fine
  • • You need to handle time-varying confounding — use marginal structural models or g-estimation
  • • No overlap in treatment groups — use instrumental variables
  • • Your research question is about mechanisms/mediation — use mediation analysis
  • • Reviewers in your field don't know DML — prepare for pushback

The most common mistake with DML is using it when a simpler method would suffice. If your confounders are 8 variables with clear linear relationships, running DML with gradient boosting is overkill — and harder to explain to reviewers. DML's value is proportional to the complexity and dimensionality of your confounding problem.

4. Clinical example: SGLT2 inhibitors and heart failure

You want to estimate the effect of SGLT2 inhibitors (e.g., dapagliflozin) on hospitalization for heart failure in a US claims database. Your dataset includes:

Treatment

  • SGLT2i prescription (binary: yes/no within first 90 days)

Outcome

  • Hospitalization for heart failure within 2 years

Confounders (15+)

  • Age, sex, race/ethnicity
  • Baseline HbA1c, eGFR, BMI
  • Comorbidity burden (diabetes type, HF severity, CKD stage)
  • Concomitant medications (ACE inhibitors, beta-blockers, diuretics)
  • Healthcare utilization (ER visits, hospitalizations in prior year)
  • Socioeconomic proxies (ZIP-level median income, insurance type)

Challenges

  • Non-linear confounding (HbA1c interacts with CKD)
  • High utilization predicts both treatment and outcome
  • Immortal time bias possible if treatment window is poorly defined

A linear regression with 15 covariates assumes all relationships are linear. A propensity score model with logistic regression assumes the right functional form for the treatment assignment mechanism. DML lets gradient boosting trees capture the non-linearities and interactions in both the treatment and outcome models, while still producing a valid, debiased estimate of the SGLT2i effect.

⚠ Critical step before DML

You still need a DAG. DML does not decide which variables to adjust for — that's a causal identification question. Your DAG says: these are confounders, these are mediators (exclude), these are colliders (don't adjust). Only then does DML handle the functional form.

5. Five ways DML fails in practice

Critical

1. Adjusting for mediators or colliders

DML is not a substitute for causal thinking. If you adjust for a mediator (e.g., post-treatment kidney function decline), you block part of the causal pathway and bias the effect estimate downward. If you adjust for a collider (e.g., being in a specialized clinic), you induce spurious associations. Always build your DAG first.

Critical

2. Using DML without cross-fitting

Some implementations skip cross-fitting for speed. This is fatal. Without cross-fitting, your confidence intervals are wrong — the ML overfitting contaminates the treatment effect estimate. Always verify your DML implementation uses cross-fitting (most do, but check).

Major

3. Using DML with tiny samples

DML with n = 80 and 5-fold cross-fitting leaves only ~16 observations per fold for the outcome model. The resulting estimates have enormous variance and the asymptotic normality that justifies the confidence intervals hasn&apos;t kicked in. DML works best with n > 500; below 200, consider simpler methods.

Major

4. Treating ML prediction performance as causal validity

A random forest with AUC 0.95 for treatment prediction does NOT mean your causal estimate is good. ML prediction accuracy is about fit; causal validity is about unbiasedness. You can have a terrible treatment model (AUC 0.55) and still get a valid DML estimate — the method is designed for this. Don&apos;t optimize for prediction metrics.

Major

5. Ignoring positivity violations

DML requires the positivity assumption — every patient must have a non-zero probability of receiving either treatment. In practice, if certain patient types (e.g., ICU patients with eGFR < 15) virtually never receive SGLT2i, the ML model cannot learn the treatment effect for them. Check overlap and consider trimming extreme propensity scores.

6. What reviewers expect: the reporting checklist

DML is still relatively new in clinical research. Reviewers may not be familiar with it, which means your reporting needs to be exceptionally clear. Here's what to include:

1

DAG: specify the causal structure that justifies which covariates enter the confounder set (mediators, colliders, instruments excluded)

2

Confounder list: all variables included, with justification (measured pre-treatment, associated with both treatment and outcome)

3

ML algorithm(s): what you used for each nuisance model (outcome model, treatment model), with rationale

4

Cross-fitting specification: K-fold number, random vs. stratified splits, seed for reproducibility

5

Nuisance model diagnostics: report performance metrics for both ML models (RMSE for outcome, AUC/accuracy for treatment) — but explain these don&apos;t determine causal validity

6

Overlap check: histogram or density plot of propensity scores; report range and any trimming applied

7

Main estimate: point estimate, 95% CI, p-value, with units and clinical interpretation

8

Sensitivity to ML choice: compare at least two ML algorithms (e.g., lasso vs. gradient boosting) — if estimates agree, report it; if they diverge, investigate

9

Sensitivity to confounding: E-value for unmeasured confounding, or a formal sensitivity analysis (e.g., Oster's delta, Cinelli & Hazlett)

10

Sample size: total N, number of treated vs. untreated, and note if the sample is large enough for asymptotic properties to hold

One additional tip: include a transparency statement about whether your results are reproducible. Provide your code, ML hyperparameters, and random seeds. DML implementations vary across packages, and exact results depend on these choices.

7. Software and implementation

Several packages implement DML with cross-fitting. Here are the main options:

PackageLanguageKey featuresBest for
EconML (Microsoft)PythonDML, DRLearner, causal forests, IVGeneral-purpose, well-documented
CausalML (Uber)PythonDML, meta-learners, uplift treesUplift modeling, industry use
doubleml (R)RDML for partially linear, interactiveR users, simulation studies
causal-drfRDoubly robust estimationFlexible nuisance models
statsmodelsPythonBasic OLS on residualsManual DML implementation

EconML is the most widely cited in health services research. It provides a scikit-learn compatible API, so you can plug in any estimator as the nuisance model. The LinearDML andNonParamDML classes handle the cross-fitting automatically. For R users, theDoubleML package by Bach, Chernozhukov, and others is equally rigorous.

⚡ Quick implementation in Python (EconML)

from econml.dml import LinearDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

model = LinearDML(
    model_y=GradientBoostingRegressor(),
    model_t=GradientBoostingClassifier(),
    cv=5,            # 5-fold cross-fitting
    random_state=42
)

model.fit(Y, T, X=X, W=W)
# Y = outcome, T = treatment
# X = heterogeneity variables (optional)
# W = confounders

effect = model.effect(X)
ci = model.effect_interval(X, alpha=0.05)
print(f"ATE: {effect.mean():.3f} (95% CI: {ci[0].mean():.3f} to {ci[1].mean():.3f})")

8. Getting automated critique with Aqrab

Aqrab is a methodology critique engine built by researchers from Mount Sinai, Johns Hopkins, and King Abdulaziz University. Paste your study protocol and get a structured critique — including bias detection, estimand alignment, and the specific fixes listed above — in under 60 seconds.

Try Aqrab free

3 free critiques. No credit card. See if Aqrab catches what you missed.

Start a critique →

Further Reading

  • Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal. 2018;21(1):C1-C68.
  • Hájek J, Błaszczyński J, Biecek P. Explainable AI for causal inference with DoubleML. arXiv preprint. 2023.
  • Belloni A, Chernozhukov V, Hansen C. High-dimensional methods and inference on structural and treatment effects. Journal of Economic Perspectives. 2014;28(2):29-50.
  • Farrell MH, Liang T, Misra S. Deep neural networks for estimation and inference. Econometrica. 2021;89(1):181-213.
  • Hernán MA, Robins JM. Causal Inference: What If. Chapman & Hall/CRC. 2020.