Aqrab is an AI-powered methodology critique engine for clinical researchers. It analyzes study protocols and detects specific biases like immortal time bias, healthy user confounding, and missing sensitivity analyses — providing named issues with actionable fixes.

How is Aqrab different from ChatGPT for research?

Generic AI gives generic advice like "consider confounding." Aqrab identifies specific biases by name (immortal time bias, competing risks, positivity violations), explains the mechanism, and provides concrete fixes. Built by researchers from Mount Sinai, Johns Hopkins, and King Abdulaziz University with 20+ years of causal inference expertise.

Yes — Aqrab offers 3 free critiques with no credit card required. The Pro plan at $99/month includes unlimited critiques, RAG-powered literature analysis, and exportable reports.

What types of studies can Aqrab critique?

Aqrab critiques observational studies (cohort, case-control, cross-sectional), clinical trials (RCTs, adaptive designs), and epidemiological analyses using methods like propensity score matching, instrumental variables, difference-in-differences, and target trial emulation.

← Back to Blog

Causal InferenceMachine LearningHigh-Dimensional Data

Double Machine Learning: A Practical Guide for Clinical Researchers

Name: Aqrab
Author: Coefficients Health Analytics

April 1, 2026·15 min read·By Coefficients Health Analytics

Double Machine Learning (DML) is the causal inference method that finally lets you use machine learning for confounder adjustment without breaking everything. Developed by Chernozhukov, Chetverikov, Demirer, Duflo, Newey, Robins, and others, DML solves a problem that bedrock methods like regression and propensity scores struggle with: what do you do when you have dozens (or hundreds) of covariates and no clear DAG? This guide covers what DML does, when to use it, how it fails, and how to report it.

In this guide

1. What DML actually does
2. Cross-fitting: why it matters
3. When DML shines — and when it doesn't
4. Clinical example: SGLT2 inhibitors and heart failure
5. Five ways DML fails in practice
6. What reviewers expect: the reporting checklist
7. Software and implementation
8. Getting automated critique with Aqrab

1. What DML actually does

The core idea of DML is deceptively simple. You have a causal question — say, "does this drug reduce hospitalization risk?" — and a dataset with many potential confounders (demographics, comorbidities, lab values, medications, utilization history). Traditional regression requires you to correctly specify the functional form for every single confounder. Get one wrong and bias creeps in.

DML lets you throw machine learning models at the confounders — lasso, random forests, gradient boosting, neural networks, whatever works — to flexibly predict (a) the treatment from confounders and (b) the outcome from confounders. Then it uses the residuals from those predictions to estimate the causal effect of treatment on outcome. This "doubling" — two ML steps, one final regression — is where the name comes from.

✓ What it does

Automates confounder adjustment when you have many covariates and no clear functional form
Protects against regularization bias — using ML alone introduces bias; DML removes it
Provides valid confidence intervals even when the ML models are misspecified
Estimates the average treatment effect (ATE) or conditional average treatment effect (CATE)
Works with both binary and continuous treatments

✗ What it does NOT do

It does NOT identify which variables are confounders — you must decide which covariates to include
It does NOT handle unmeasured confounding — if you miss a key confounder, bias remains
It does NOT replace a DAG — DML is an estimation method, not an identification strategy
It does NOT work well with very small samples (n < 200) where cross-fitting creates noisy estimates
It does NOT automatically select the right ML algorithm — you must validate model performance

The key theoretical property is Neyman orthogonality: even if your ML models for the nuisance functions (the confounder models) are imperfect, the final treatment effect estimate remains consistent. This is the "double" protection — you only need to get the confounders approximately right, not perfectly right.

2. Cross-fitting: why it matters

You cannot use the same data to fit the ML models AND estimate the treatment effect. If you do, you get overfitting bias — the same problem that makes lasso estimates of treatment effects biased toward zero. DML solves this with cross-fitting.

Here's how it works. Split the data into K folds (typically K = 5). For each fold:

1Fit the ML models (treatment and outcome) on the other K-1 folds.
2Use those fitted models to predict treatment and outcome for the held-out fold.
3Compute residuals (predicted treatment vs. actual treatment, predicted outcome vs. actual outcome) on the held-out fold.
4Regress outcome residuals on treatment residuals to get the causal effect estimate.

Average the K estimates. Because each fold was predicted by a model trained on different data, there's no overfitting contamination. This is identical in spirit to cross-validation, but used for causal estimation rather than prediction accuracy.

⚡ Practical note

The number of folds matters less than you think — 5-fold and 10-fold give similar results. What matters is that you do cross-fitting. Using the same data for both ML fitting and effect estimation invalidates your confidence intervals.

3. When DML shines — and when it doesn't

Strong use cases

• Large observational datasets (claims, EHR) with 20+ covariates
• Complex confounder relationships (interactions, non-linearities, high-dimensional data like diagnoses codes)
• When you don't trust your functional form assumptions in regression
• Binary or continuous treatment exposure
• Research question: average treatment effect (ATE)
• Electronic health record data with thousands of variables

Consider alternatives when

• Small sample (n < 200 per group) — DML variance explodes
• You have a clear, simple DAG with few confounders — standard regression is fine
• You need to handle time-varying confounding — use marginal structural models or g-estimation
• No overlap in treatment groups — use instrumental variables
• Your research question is about mechanisms/mediation — use mediation analysis
• Reviewers in your field don't know DML — prepare for pushback

The most common mistake with DML is using it when a simpler method would suffice. If your confounders are 8 variables with clear linear relationships, running DML with gradient boosting is overkill — and harder to explain to reviewers. DML's value is proportional to the complexity and dimensionality of your confounding problem.

4. Clinical example: SGLT2 inhibitors and heart failure

You want to estimate the effect of SGLT2 inhibitors (e.g., dapagliflozin) on hospitalization for heart failure in a US claims database. Your dataset includes:

Treatment

SGLT2i prescription (binary: yes/no within first 90 days)

Outcome

Hospitalization for heart failure within 2 years

Confounders (15+)

Age, sex, race/ethnicity
Baseline HbA1c, eGFR, BMI
Comorbidity burden (diabetes type, HF severity, CKD stage)
Concomitant medications (ACE inhibitors, beta-blockers, diuretics)
Healthcare utilization (ER visits, hospitalizations in prior year)
Socioeconomic proxies (ZIP-level median income, insurance type)

Challenges

Non-linear confounding (HbA1c interacts with CKD)
High utilization predicts both treatment and outcome
Immortal time bias possible if treatment window is poorly defined

A linear regression with 15 covariates assumes all relationships are linear. A propensity score model with logistic regression assumes the right functional form for the treatment assignment mechanism. DML lets gradient boosting trees capture the non-linearities and interactions in both the treatment and outcome models, while still producing a valid, debiased estimate of the SGLT2i effect.

⚠ Critical step before DML

You still need a DAG. DML does not decide which variables to adjust for — that's a causal identification question. Your DAG says: these are confounders, these are mediators (exclude), these are colliders (don't adjust). Only then does DML handle the functional form.

5. Five ways DML fails in practice

Critical

1. Adjusting for mediators or colliders

DML is not a substitute for causal thinking. If you adjust for a mediator (e.g., post-treatment kidney function decline), you block part of the causal pathway and bias the effect estimate downward. If you adjust for a collider (e.g., being in a specialized clinic), you induce spurious associations. Always build your DAG first.

Critical

2. Using DML without cross-fitting

Some implementations skip cross-fitting for speed. This is fatal. Without cross-fitting, your confidence intervals are wrong — the ML overfitting contaminates the treatment effect estimate. Always verify your DML implementation uses cross-fitting (most do, but check).

Major

3. Using DML with tiny samples

DML with n = 80 and 5-fold cross-fitting leaves only ~16 observations per fold for the outcome model. The resulting estimates have enormous variance and the asymptotic normality that justifies the confidence intervals hasn't kicked in. DML works best with n > 500; below 200, consider simpler methods.

Major

4. Treating ML prediction performance as causal validity

A random forest with AUC 0.95 for treatment prediction does NOT mean your causal estimate is good. ML prediction accuracy is about fit; causal validity is about unbiasedness. You can have a terrible treatment model (AUC 0.55) and still get a valid DML estimate — the method is designed for this. Don't optimize for prediction metrics.

Major

5. Ignoring positivity violations

DML requires the positivity assumption — every patient must have a non-zero probability of receiving either treatment. In practice, if certain patient types (e.g., ICU patients with eGFR < 15) virtually never receive SGLT2i, the ML model cannot learn the treatment effect for them. Check overlap and consider trimming extreme propensity scores.

6. What reviewers expect: the reporting checklist

DML is still relatively new in clinical research. Reviewers may not be familiar with it, which means your reporting needs to be exceptionally clear. Here's what to include:

DAG: specify the causal structure that justifies which covariates enter the confounder set (mediators, colliders, instruments excluded)

Confounder list: all variables included, with justification (measured pre-treatment, associated with both treatment and outcome)

ML algorithm(s): what you used for each nuisance model (outcome model, treatment model), with rationale

Cross-fitting specification: K-fold number, random vs. stratified splits, seed for reproducibility

Nuisance model diagnostics: report performance metrics for both ML models (RMSE for outcome, AUC/accuracy for treatment) — but explain these don't determine causal validity

Overlap check: histogram or density plot of propensity scores; report range and any trimming applied

Main estimate: point estimate, 95% CI, p-value, with units and clinical interpretation

Sensitivity to ML choice: compare at least two ML algorithms (e.g., lasso vs. gradient boosting) — if estimates agree, report it; if they diverge, investigate

Sensitivity to confounding: E-value for unmeasured confounding, or a formal sensitivity analysis (e.g., Oster's delta, Cinelli & Hazlett)

Sample size: total N, number of treated vs. untreated, and note if the sample is large enough for asymptotic properties to hold

One additional tip: include a transparency statement about whether your results are reproducible. Provide your code, ML hyperparameters, and random seeds. DML implementations vary across packages, and exact results depend on these choices.

7. Software and implementation

Several packages implement DML with cross-fitting. Here are the main options:

PackageLanguageKey featuresBest for

EconML (Microsoft)PythonDML, DRLearner, causal forests, IVGeneral-purpose, well-documented

CausalML (Uber)PythonDML, meta-learners, uplift treesUplift modeling, industry use

doubleml (R)RDML for partially linear, interactiveR users, simulation studies

causal-drfRDoubly robust estimationFlexible nuisance models

statsmodelsPythonBasic OLS on residualsManual DML implementation

EconML is the most widely cited in health services research. It provides a scikit-learn compatible API, so you can plug in any estimator as the nuisance model. The LinearDML andNonParamDML classes handle the cross-fitting automatically. For R users, theDoubleML package by Bach, Chernozhukov, and others is equally rigorous.

⚡ Quick implementation in Python (EconML)

from econml.dml import LinearDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

model = LinearDML(
    model_y=GradientBoostingRegressor(),
    model_t=GradientBoostingClassifier(),
    cv=5,            # 5-fold cross-fitting
    random_state=42
)

model.fit(Y, T, X=X, W=W)
# Y = outcome, T = treatment
# X = heterogeneity variables (optional)
# W = confounders

effect = model.effect(X)
ci = model.effect_interval(X, alpha=0.05)
print(f"ATE: {effect.mean():.3f} (95% CI: {ci[0].mean():.3f} to {ci[1].mean():.3f})")

8. Getting automated critique with Aqrab

Aqrab is a methodology critique engine built by researchers from Mount Sinai, Johns Hopkins, and King Abdulaziz University. Paste your study protocol and get a structured critique — including bias detection, estimand alignment, and the specific fixes listed above — in under 60 seconds.

Try Aqrab free

3 free critiques. No credit card. See if Aqrab catches what you missed.

Start a critique →

Double Machine Learning: A Practical Guide for Clinical Researchers

In this guide

1. What DML actually does

2. Cross-fitting: why it matters

3. When DML shines — and when it doesn't

4. Clinical example: SGLT2 inhibitors and heart failure

5. Five ways DML fails in practice

1. Adjusting for mediators or colliders

2. Using DML without cross-fitting

3. Using DML with tiny samples

4. Treating ML prediction performance as causal validity

5. Ignoring positivity violations

6. What reviewers expect: the reporting checklist

7. Software and implementation

8. Getting automated critique with Aqrab

Try Aqrab free

Further Reading