Skip to contents

Why This Matters

When covariates are regressed out on the full dataset before cross-validation, held-out test information leaks into the training pipeline. This can inflate or distort out-of-sample performance.

In cpmr, leakage-safe behavior means:

  • fit covariate regression on each fold’s training split only;
  • apply the learned regression parameters to both train and test data inside the same fold;
  • keep other parameter-learning steps (for example scaling or model selection) inside CV as well.

Minimal Example with Synthetic Data

library(cpmr)

set.seed(123)
n <- 80
p <- 120

# Synthetic connectivity matrix (rows = subjects, cols = edges)
conmat <- matrix(rnorm(n * p), nrow = n, ncol = p)

# A synthetic covariate (for example age-like nuisance variable)
covariates <- matrix(rnorm(n), ncol = 1)

# Build behavior with both signal and covariate effects
edge_signal <- rowMeans(conmat[, 1:10, drop = FALSE])
behav <- 0.7 * edge_signal + 0.6 * covariates[, 1] + rnorm(n, sd = 0.5)

# Leakage-safe call: covariates are handled fold-wise inside CV
fit <- cpm(
  conmat = conmat,
  behav = behav,
  covariates = covariates,
  kfolds = 5
)

summary(fit)
#> CPM summary:
#>   Performance (Pearson):
#>     Positive: 0.279
#>     Negative: -0.104
#>     Combined: 0.259
#>   Prop. edges (50% folds):
#>     Positive: 1.67%
#>     Negative: 0.00%

Migration from confounds to covariates

covariates is now the primary argument name.

# Preferred
fit_new <- cpm(conmat, behav, covariates = covariates, kfolds = 5)

The old name confounds is still accepted as a deprecated alias for backward compatibility.

# Backward-compatible alias (deprecated)
fit_old <- cpm(conmat, behav, confounds = covariates, kfolds = 5)
#> Warning: `confounds` is deprecated; please use `covariates` instead.

Do not pass both in the same call.

cpm(conmat, behav, covariates = covariates, confounds = covariates)
#> Error in `resolve_covariates()`:
#> ! Please provide only one of `covariates` or `confounds`.

Anti-Leakage Checklist

Use this checklist when building CPM workflows:

  • Split train/test folds before any parameter-learning step.
  • Fit covariate regression on the training split only.
  • Apply training-fitted covariate regression to the fold’s test split.
  • Keep scaling and feature selection inside each fold.
  • If comparing multiple settings, use nested CV for model selection.
  • Report the CV design and covariate strategy explicitly.

References