Skip to contents

Efficient computations for linear and generalized linear models

The most straightforward way to implement cross-validation in R for statistical modeling functions that are written in the canonical manner is to use update() to refit the model with each fold removed. This is the approach taken in the default method for cv(), and it is appropriate if the cases are independently sampled. Refitting the model in this manner for each fold is generally feasible when the number of folds in modest, but can be prohibitively costly for leave-one-out cross-validation when the number of cases is large.

The "lm" and "glm" methods for cv() take advantage of computational efficiencies by avoiding refitting the model with each fold removed. Consider, in particular, the weighted linear model 𝐲nΓ—1=𝐗nΓ—p𝛃pΓ—1+𝛆nΓ—1\mathbf{y}_{n \times 1} = \mathbf{X}_{n \times p}\boldsymbol{\beta}_{p \times 1} + \boldsymbol{\varepsilon}_{n \times 1}, where π›†βˆΌπn(𝟎,Οƒ2𝐖nΓ—nβˆ’1)\boldsymbol{\varepsilon} \sim \mathbf{N}_n \left(\mathbf{0}, \sigma^2 \mathbf{W}^{-1}_{n \times n}\right). Here, 𝐲\mathbf{y} is the response vector, 𝐗\mathbf{X} the model matrix, and 𝛆\boldsymbol{\varepsilon} the error vector, each for nn cases, and 𝛃\boldsymbol{\beta} is the vector of pp population regression coefficients. The errors are assumed to be multivariately normally distributed with 0 means and covariance matrix Οƒ2π–βˆ’1\sigma^2 \mathbf{W}^{-1}, where 𝐖=diag(wi)\mathbf{W} = \mathrm{diag}(w_i) is a diagonal matrix of inverse-variance weights. For the linear model with constant error variance, the weight matrix is taken to be 𝐖=𝐈n\mathbf{W} = \mathbf{I}_n, the order-nn identity matrix.

The weighted-least-squares (WLS) estimator of 𝛃\boldsymbol{\beta} is (see, e.g., Fox, 2016, sec. 12.2.2) 1 𝐛WLS=(𝐗T𝐖𝐗)βˆ’1𝐗T𝐖𝐲 \mathbf{b}_{\mathrm{WLS}} = \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{W} \mathbf{y}

Fitted values are then 𝐲̂=𝐗𝐛WLS\widehat{\mathbf{y}} = \mathbf{X}\mathbf{b}_{\mathrm{WLS}}.

The LOO fitted value for the iith case can be efficiently computed by yΜ‚βˆ’i=yiβˆ’ei/(1βˆ’hi)\widehat{y}_{-i} = y_i - e_i/(1 - h_i) where hi=𝐱iT(𝐗T𝐖𝐗)βˆ’1𝐱ih_i = \mathbf{x}^T_i \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{x}_i (the so-called β€œhatvalue”). Here, 𝐱iT\mathbf{x}^T_i is the iith row of 𝐗\mathbf{X}, and 𝐱i\mathbf{x}_i is the iith row written as a column vector. This approach can break down when one or more hatvalues are equal to 1, in which case the formula for yΜ‚βˆ’i\widehat{y}_{-i} requires division by 0.

To compute cross-validated fitted values when the folds contain more than one case, we make use of the Woodbury matrix identity (Hager, 1989), (𝐀mΓ—m+𝐔mΓ—k𝐂kΓ—k𝐕kΓ—m)βˆ’1=π€βˆ’1βˆ’π€βˆ’1𝐔(π‚βˆ’1+π•π€βˆ’1𝐔)βˆ’1π•π€βˆ’1 \left(\mathbf{A}_{m \times m} + \mathbf{U}_{m \times k} \mathbf{C}_{k \times k} \mathbf{V}_{k \times m} \right)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1}\mathbf{U} \left(\mathbf{C}^{-1} + \mathbf{VA}^{-1}\mathbf{U} \right)^{-1} \mathbf{VA}^{-1} where 𝐀\mathbf{A} is a nonsingular order-nn matrix. We apply this result by letting 𝐀=𝐗T𝐖𝐗𝐔=𝐗𝐣T𝐕=βˆ’π—π£π‚=𝐖𝐣\begin{align*} \mathbf{A} &= \mathbf{X}^T \mathbf{W} \mathbf{X} \\ \mathbf{U} &= \mathbf{X}_\mathbf{j}^T \\ \mathbf{V} &= - \mathbf{X}_\mathbf{j} \\ \mathbf{C} &= \mathbf{W}_\mathbf{j} \\ \end{align*} where the subscript 𝐣=(ij1,…,ijm)T\mathbf{j} = (i_{j1}, \ldots, i_{jm})^T represents the vector of indices for the cases in the jjth fold, j=1,…,kj = 1, \ldots, k. The negative sign in 𝐕=βˆ’π—π£\mathbf{V} = - \mathbf{X}_\mathbf{j} reflects the removal, rather than addition, of the cases in 𝐣\mathbf{j}.

Applying the Woodbury identity isn’t quite as fast as using the hatvalues, but it is generally much faster than refitting the model. A disadvantage of the Woodbury identity, however, is that it entails explicit matrix inversion and thus may be numerically unstable. The inverse of 𝐀=𝐗T𝐖𝐗\mathbf{A} = \mathbf{X}^T \mathbf{W} \mathbf{X} is available directly in the "lm" object, but the second term on the right-hand side of the Woodbury identity requires a matrix inversion with each fold deleted. (In contrast, the inverse of each 𝐂=𝐖𝐣\mathbf{C} = \mathbf{W}_\mathbf{j} is straightforward because 𝐖\mathbf{W} is diagonal.)

The Woodbury identity also requires that the model matrix be of full rank. We impose that restriction in our code by removing redundant regressors from the model matrix for all of the cases, but that doesn’t preclude rank deficiency from surfacing when a fold is removed. Rank deficiency of 𝐗\mathbf{X} doesn’t disqualify cross-validation because all we need are fitted values under the estimated model.

glm() computes the maximum-likelihood estimates for a generalized linear model by iterated weighted least squares (see, e.g., Fox & Weisberg, 2019, sec. 6.12). The last iteration is therefore just a WLS fit of the β€œworking response” on the model matrix using β€œworking weights.” Both the working weights and the working response at convergence are available from the information in the object returned by glm().

We then treat re-estimation of the model with a case or cases deleted as a WLS problem, using the hatvalues or the Woodbury matrix identity. The resulting fitted values for the deleted fold aren’t exactβ€”that is, except for the Gaussian family, the result isn’t identical to what we would obtain by literally refitting the modelβ€”but in our (limited) experience, the approximation is very good, especially for LOO CV, which is when we would be most tempted to use it. Nevertheless, because these results are approximate, the default for the "glm" cv() method is to perform the exact computation, which entails refitting the model with each fold omitted.

Let’s compare the efficiency of the various computational methods for linear and generalized linear models. Consider, for example, leave-one-out cross-validation for the quadratic regression of mpg on horsepower in the Auto data, from the introductory β€œCross-validating regression models” vignette, repeated here:

data("Auto", package="ISLR2")
m.auto <- lm(mpg ~ poly(horsepower, 2), data = Auto)
summary(m.auto)
#> 
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -14.714  -2.594  -0.086   2.287  15.896 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            23.446      0.221   106.1   <2e-16 ***
#> poly(horsepower, 2)1 -120.138      4.374   -27.5   <2e-16 ***
#> poly(horsepower, 2)2   44.090      4.374    10.1   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.37 on 389 degrees of freedom
#> Multiple R-squared:  0.688,  Adjusted R-squared:  0.686 
#> F-statistic:  428 on 2 and 389 DF,  p-value: <2e-16

library("cv")
#> Loading required package: doParallel
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
cv(m.auto, k = "loo")  # default method = "hatvalues"
#> n-Fold Cross Validation
#> method: hatvalues
#> criterion: mse
#> cross-validation criterion = 19.248
cv(m.auto, k = "loo", method = "naive")
#> n-Fold Cross Validation
#> method: naive
#> criterion: mse
#> cross-validation criterion = 19.248
#> bias-adjusted cross-validation criterion = 19.248
#> full-sample criterion = 18.985
cv(m.auto, k = "loo", method = "Woodbury")
#> n-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 19.248
#> bias-adjusted cross-validation criterion = 19.248
#> full-sample criterion = 18.985

This is a small regression problem and all three computational approaches are essentially instantaneous, but it is still of interest to investigate their relative speed. In this comparison, we include the cv.glm() function from the boot package (Canty & Ripley, 2022; Davison & Hinkley, 1997), which takes the naive approach, and for which we have to fit the linear model as an equivalent Gaussian GLM. We use the microbenchmark() function from the package of the same name for the timings (Mersmann, 2023):

m.auto.glm <- glm(mpg ~ poly(horsepower, 2), data = Auto)
boot::cv.glm(Auto, m.auto.glm)$delta
#> [1] 19.248 19.248

microbenchmark::microbenchmark(
  hatvalues = cv(m.auto, k = "loo"),
  Woodbury = cv(m.auto, k = "loo", method = "Woodbury"),
  naive = cv(m.auto, k = "loo", method = "naive"),
  cv.glm = boot::cv.glm(Auto, m.auto.glm),
  times = 10
)
#> Warning in microbenchmark::microbenchmark(hatvalues = cv(m.auto, k = "loo"), :
#> less accurate nanosecond times to avoid potential integer overflows
#> Unit: milliseconds
#>       expr      min      lq     mean  median       uq      max neval
#>  hatvalues   1.1035   1.198   1.3907   1.415   1.4806   1.6819    10
#>   Woodbury  13.6232  13.968  16.9573  15.566  20.5877  22.7541    10
#>      naive 260.7600 282.213 304.0214 294.541 320.3055 386.6713    10
#>     cv.glm 448.4287 462.121 531.9857 532.362 567.2140 674.7202    10

On our computer, using the hatvalues is about an order of magnitude faster than employing Woodbury matrix updates, and more than two orders of magnitude faster than refitting the model.2

Similarly, let’s return to the logistic-regression model fit to Mroz’s data on women’s labor-force participation, also employed as an example in the introductory vignette:

data("Mroz", package="carData")
m.mroz <- glm(lfp ~ ., data = Mroz, family = binomial)
summary(m.mroz)
#> 
#> Call:
#> glm(formula = lfp ~ ., family = binomial, data = Mroz)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  3.18214    0.64438    4.94  7.9e-07 ***
#> k5          -1.46291    0.19700   -7.43  1.1e-13 ***
#> k618        -0.06457    0.06800   -0.95  0.34234    
#> age         -0.06287    0.01278   -4.92  8.7e-07 ***
#> wcyes        0.80727    0.22998    3.51  0.00045 ***
#> hcyes        0.11173    0.20604    0.54  0.58762    
#> lwg          0.60469    0.15082    4.01  6.1e-05 ***
#> inc         -0.03445    0.00821   -4.20  2.7e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1029.75  on 752  degrees of freedom
#> Residual deviance:  905.27  on 745  degrees of freedom
#> AIC: 921.3
#> 
#> Number of Fisher Scoring iterations: 4

cv(m.mroz, # default method = "exact"
   k = "loo", 
   criterion = BayesRule)
#> n-Fold Cross Validation
#> method: exact
#> criterion: BayesRule
#> cross-validation criterion = 0.32005
#> bias-adjusted cross-validation criterion = 0.3183
#> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164)
#> full-sample criterion = 0.30677
cv(m.mroz,
   k = "loo",
   criterion = BayesRule,
   method = "Woodbury")
#> n-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.32005
#> bias-adjusted cross-validation criterion = 0.3183
#> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164)
#> full-sample criterion = 0.30677
cv(m.mroz,
   k = "loo",
   criterion = BayesRule,
   method = "hatvalues")
#> n-Fold Cross Validation
#> method: hatvalues
#> criterion: BayesRule
#> cross-validation criterion = 0.32005

As for linear models, we report some timings for the various cv() methods of computation in LOO CV as well as for the cv.glm() function from the boot package (which, recall, refits the model with each case removed, and thus is comparable to cv() with method="exact"):

microbenchmark::microbenchmark(
  hatvalues = cv(
    m.mroz,
    k = "loo",
    criterion = BayesRule,
    method = "hatvalues"
  ),
  Woodbury = cv(
    m.mroz,
    k = "loo",
    criterion = BayesRule,
    method = "Woodbury"
  ),
  exact = cv(m.mroz, k = "loo", criterion = BayesRule),
  cv.glm = boot::cv.glm(Mroz, m.mroz,
                        cost = BayesRule),
  times = 10
)
#> Unit: milliseconds
#>       expr       min        lq      mean    median        uq       max neval
#>  hatvalues    1.3072    1.5904    1.9441    1.7067    1.8755    3.2658    10
#>   Woodbury   44.2169   46.5797   57.2528   50.5824   53.8776  126.2914    10
#>      exact 2054.0891 2286.9873 2362.5855 2415.9008 2459.2381 2643.5164    10
#>     cv.glm 2537.1694 2629.1091 2737.9928 2705.9460 2799.6988 3105.5054    10

There is a substantial time penalty associated with exact computations.

Computation of the bias-corrected CV criterion and confidence intervals

Let CV(𝐲,𝐲̂)\mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}) represent a cross-validation cost criterion, such as mean-squared error, computed for all of the nn values of the response 𝐲\mathbf{y} based on fitted values 𝐲̂\widehat{\mathbf{y}} from the model fit to all of the data. We require that CV(𝐲,𝐲̂)\mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}) is the mean of casewise components, that is, CV(𝐲,𝐲̂)=1nβˆ‘i=1ncv(yi,yΜ‚i)\mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}) = \frac{1}{n}\sum_{i=1}^n\mathrm{cv}(y_i, \widehat{y}_i).3 For example, MSE(𝐲,𝐲̂)=1nβˆ‘i=1n(yiβˆ’yΜ‚i)2\mathrm{MSE}(\mathbf{y}, \widehat{\mathbf{y}}) = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{y}_i)^2.

We divide the nn cases into kk folds of approximately njβ‰ˆn/kn_j \approx n/k cases each, where n=βˆ‘njn = \sum n_j. As above, let 𝐣\mathbf{j} denote the indices of the cases in the jjth fold.

Now define CVj=CV(𝐲,𝐲̂(j))\mathrm{CV}_j = \mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}^{(j)}). The superscript (j)(j) on 𝐲̂(j)\widehat{\mathbf{y}}^{(j)} represents fitted values computed for all of the cases from the model with fold jj omitted. Let 𝐲̂(βˆ’i)\widehat{\mathbf{y}}^{(-i)} represent the vector of fitted values for all nn cases where the fitted value for the iith case is computed from the model fit with the fold including the iith case omitted (i.e., fold jj for which i∈𝐣i \in \mathbf{j}).

Then the cross-validation criterion is just CV=CV(𝐲,𝐲̂(βˆ’i))\mathrm{CV} = \mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}^{(-i)}). Following Davison & Hinkley (1997, pp. 293–295), the bias-adjusted cross-validation criterion is CVadj=CV+CV(𝐲,𝐲̂)βˆ’1nβˆ‘j=1knjCVj \mathrm{CV}_{\mathrm{adj}} = \mathrm{CV} + \mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}) - \frac{1}{n} \sum_{j=1}^{k} n_j \mathrm{CV}_j

We compute the standard error of CV as SE(CV)=1nβˆ‘i=1n[cv(yi,yΜ‚i(βˆ’i))βˆ’CV]2nβˆ’1 \mathrm{SE}(\mathrm{CV}) = \frac{1}{\sqrt n} \sqrt{ \frac{\sum_{i=1}^n \left[ \mathrm{cv}(y_i, \widehat{y}_i^{(-i)} ) - \mathrm{CV} \right]^2 }{n - 1} } that is, as the standard deviation of the casewise components of CV divided by the square-root of the number of cases.

We then use SE(CV)\mathrm{SE}(\mathrm{CV}) to construct a 100Γ—(1βˆ’Ξ±)100 \times (1 - \alpha)% confidence interval around the adjusted CV estimate of error: [CVadjβˆ’z1βˆ’Ξ±/2SE(CV),CVadj+z1βˆ’Ξ±/2SE(CV)] \left[ \mathrm{CV}_{\mathrm{adj}} - z_{1 - \alpha/2}\mathrm{SE}(\mathrm{CV}), \mathrm{CV}_{\mathrm{adj}} + z_{1 - \alpha/2}\mathrm{SE}(\mathrm{CV}) \right] where z1βˆ’Ξ±/2z_{1 - \alpha/2} is the 1βˆ’Ξ±/21 - \alpha/2 quantile of the standard-normal distribution (e.g, zβ‰ˆ1.96z \approx 1.96 for a 95% confidence interval, for which 1βˆ’Ξ±/2=.9751 - \alpha/2 = .975).

Bates, Hastie, & Tibshirani (2023) show that the coverage of this confidence interval is poor for small samples, and they suggest a much more computationally intensive procedure, called nested cross-validation, to compute better estimates of error and confidence intervals with better coverage for small samples. We may implement Bates et al.’s approach in a later release of the cv package. At present we use the confidence interval above for sufficiently large nn, which, based on Bates et al.’s results, we take by default to be nβ‰₯400n \ge 400.

Why the complement of AUC isn’t a casewise CV criterion

Consider calculating AUC for folds in which a validation set contains nvn_v observations. To calculate AUC in the validation set, we need the vector of prediction criteria, 𝐲̂v(nvΓ—1)=(yΜ‚1,...,yΜ‚nv)T\widehat{\mathbf{y}}_{v_{(n_v \times 1)}} = (\widehat{y}_1, ..., \widehat{y}_{n_v})^T, and the vector of observed responses in the validation set, 𝐲v(nvΓ—1)=(y1,…,ynv)T\mathbf{y}_{v_{(n_v \times 1)}} = (y_1, \ldots, y_{n_v})^T with yi∈{0,1},i=1,…,nvy_i \in \{0,1\}, \; i = 1, \ldots, n_v.

To construct the ROC curve, only the ordering of the values in 𝐲̂v\mathbf{\widehat{y}}_v is relevant. Thus, assuming that there are no ties, and reordering observations if necessary, we can set 𝐲̂v=(1,2,…,nv)T\mathbf{\widehat{y}}_v = (1, 2, \ldots, n_v)^T.

If the AUC can be expressed as the casewise mean or sum of a function cv(yΜ‚i,yi)\mathrm{cv}(\widehat{y}_i,y_i), where cv:{1,2,...,nv}Γ—{0,1}β†’[0,1]\mathrm{cv}: \{1,2,...,n_v\}\times\{0,1\} \rightarrow [0,1], then βˆ‘i=1nvcv(yΜ‚i,yi)=AUC(𝐲̂v,𝐲v)\begin{equation} \label{eq:cw} \tag{1} \sum_{i=1}^{n_v} \mathrm{cv}(\widehat{y}_i,y_i) = \mathrm{AUC}(\mathbf{\widehat{y}}_v,\mathbf{y}_v) \end{equation} must hold for all 2nv2^{n_v} possible values of 𝐲v=(y1,...,ynv)T\mathbf{y}_v = (y_1,...,y_{n_v})^T. If all ysy\mathrm{s} have the same value, either 1 or 0, then the definition of AUC is ambiguous. AUC could be considered undefined, or it could be set to 0 if all yys are 0 and to 1 if all yys are 1. If AUC is considered to be undefined in these cases, we have 2nvβˆ’22^{n_v} - 2 admissible values for 𝐲v\mathbf{y}_v.

Thus, equation () produces either 2nv2^{n_v} or 2nvβˆ’22^{n_v}-2 constraints. Although there are only 2nv2n_v possible values for the cv(β‹…)\mathrm{cv(\cdot)} function, equation () could, nevertheless, have consistent solutions. We therefore need to determine whether there is a value of nvn_v for which () has no consistent solution for all admissible values of 𝐲v\mathbf{y}_v. In that eventuality, we will have shown that AUC cannot, in general, be expressed through a casewise sum.

If nv=3n_v=3, we show below that () has no consistent solution if we include all possibilities for 𝐲v\mathbf{y}_v, but does if we exclude cases where all yys have the same value. If nv=4n_v=4, we show that there are no consistent solutions in either case.

The following R function computes AUC from 𝐲̂v\mathbf{\widehat{y}}_v and 𝐲v\mathbf{y}_v, accommodating the cases where 𝐲v\mathbf{y}_v is all 0s or all 1s:

AUC <- function(y, yhat = seq_along(y)) {
  s <- sum(y)
  if (s == 0)
    return(0)
  if (s == length(y))
    return(1)
  Metrics::auc(y, yhat)
}

We then define a function to generate all possible 𝐲v\mathbf{y}_vs of length nvn_v as rows of the matrix 𝐘(2nvΓ—nv)\mathbf{Y}_{(2^{n_v} \times n_v)}:

Ymat <- function(n_v, exclude_identical = FALSE) {
  stopifnot(n_v > 0 &&
              round(n_v) == n_v)    # n_v must be a positive integer
  ret <- sapply(0:(2 ^ n_v - 1),
                function(x)
                  as.integer(intToBits(x)))[1:n_v,]
  ret <- if (is.matrix(ret))
    t(ret)
  else
    matrix(ret)
  colnames(ret) <- paste0("y", 1:ncol(ret))
  if (exclude_identical)
    ret[-c(1, nrow(ret)),]
  else
    ret
}

For nv=3n_v=3,

Ymat(3)
#>      y1 y2 y3
#> [1,]  0  0  0
#> [2,]  1  0  0
#> [3,]  0  1  0
#> [4,]  1  1  0
#> [5,]  0  0  1
#> [6,]  1  0  1
#> [7,]  0  1  1
#> [8,]  1  1  1

If we exclude 𝐲v\mathbf{y}_vs with identical values, then

Ymat(3, exclude_identical = TRUE)
#>      y1 y2 y3
#> [1,]  1  0  0
#> [2,]  0  1  0
#> [3,]  1  1  0
#> [4,]  0  0  1
#> [5,]  1  0  1
#> [6,]  0  1  1

Here is 𝐘\mathbf{Y} with corresponding values of AUC:

cbind(Ymat(3), AUC = apply(Ymat(3), 1, AUC))
#>      y1 y2 y3 AUC
#> [1,]  0  0  0 0.0
#> [2,]  1  0  0 0.0
#> [3,]  0  1  0 0.5
#> [4,]  1  1  0 0.0
#> [5,]  0  0  1 1.0
#> [6,]  1  0  1 0.5
#> [7,]  0  1  1 1.0
#> [8,]  1  1  1 1.0

The values of cv(yΜ‚i,yi)\mathrm{cv}(\widehat{y}_i, y_i) that express AUC as a sum of casewise values are solutions of equation (), which can be written as solutions of the following system of 2nv2^{n_v} linear simultaneous equations in 2nv2n_v unknowns: (π”βˆ’π˜)𝐜0+𝐘𝐜1=[π”βˆ’π˜,𝐘][𝐜0𝐜1]=AUC(π˜Μ‚,𝐘)\begin{equation} \label{eq:lin} \tag{2} (\mathbf{U} -\mathbf{Y}) \mathbf{c}_0 + \mathbf{Y} \mathbf{c}_1 = [\mathbf{U} -\mathbf{Y}, \mathbf{Y}] \begin{bmatrix} \mathbf{c}_0 \\ \mathbf{c}_1 \end{bmatrix} = \mathrm{AUC}(\mathbf{\widehat{Y}},\mathbf{Y}) \end{equation} where 𝐔(2nvΓ—nv)\mathbf{U}_{(2^{n_v} \times n_v)} is a matrix of 1s conformable with 𝐘\mathbf{Y}; 𝐜0=[cv(1,0),c(2,0),...,cv(nv,0)]T\mathbf{c}_0 = [\mathrm{cv}(1,0), c(2,0), ..., \mathrm{cv}(n_v,0)]^T; 𝐜1=[cv(1,1),c(2,1),...,cv(nv,1)]T\mathbf{c}_1 = [\mathrm{cv}(1,1), c(2,1), ..., \mathrm{cv}(n_v,1)]^T; [π”βˆ’π˜,𝐘](2nvΓ—2nv)[\mathbf{U} -\mathbf{Y}, \mathbf{Y}]_{(2^{n_v} \times 2n_v)} and [𝐜0𝐜1](2nvΓ—1)\begin{bmatrix}\begin{aligned} \mathbf{c}_0 \\ \mathbf{c}_1 \end{aligned} \end{bmatrix}_{(2n_v \times 1)} are partitioned matrices; and π˜Μ‚(2nvΓ—nv)\mathbf{\widehat{Y}}_{(2^{n_v} \times n_v)} is a matrix each of whose rows consists of the integers 1 to nvn_v.

We can test whether equation () has a solution for any given nvn_v by trying to solve it as a least-squares problem, considering whether the residuals of the associated linear model are all 0, using the β€œdesign matrix” [π”βˆ’π˜,𝐘][\mathbf{U} -\mathbf{Y}, \mathbf{Y}] to predict the β€œoutcome” AUC(π˜Μ‚,𝐘)(2nvΓ—1)\mathrm{AUC}(\mathbf{\widehat{Y}},\mathbf{Y})_{(2^{n_v} \times 1)}:

resids <- function(n_v,
                   exclude_identical = FALSE,
                   tol = sqrt(.Machine$double.eps)) {
  Y <- Ymat(n_v, exclude_identical = exclude_identical)
  AUC <- apply(Y, 1, AUC)
  X <- cbind(1 - Y, Y)
  opts <- options(warn = -1)
  on.exit(options(opts))
  fit <- lsfit(X, AUC, intercept = FALSE)
  ret <- max(abs(residuals(fit)))
  if (ret < tol) {
    ret <- 0
    solution <- coef(fit)
    names(solution) <- paste0("c(", c(1:n_v, 1:n_v), ",",
                              rep(0:1, each = n_v), ")")
    attr(ret, "solution") <- zapsmall(solution)
  }
  ret
}

The case nv=3n_v=3, excluding identical yys, has a solution:

resids(3, exclude_identical = TRUE)
#> [1] 0
#> attr(,"solution")
#> c(1,0) c(2,0) c(3,0) c(1,1) c(2,1) c(3,1) 
#>    1.0    0.0   -0.5    0.5    0.0    0.0

But, if identical yys are included, the equation is not consistent:

resids(3, exclude_identical = FALSE)
#> [1] 0.125

For nv=4n_v=4, there are no solutions in either case:

resids(4, exclude_identical = TRUE)
#> [1] 0.083333
resids(4, exclude_identical = FALSE)
#> [1] 0.25

Consequently, the widely employed AUC measure of fit for binary regression cannot in general be used for a casewise cross-validation criterion.

References

Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. Retrieved from https://doi.org/10.1214/09-SS054
Bates, S., Hastie, T., & Tibshirani, R. (2023). Cross-validation: What does it estimate and how well does it do it? Journal of the American Statistical Association, in press. Retrieved from https://doi.org/10.1080/01621459.2023.2197686
Canty, A., & Ripley, B. D. (2022). Boot: Bootstrap R (S-plus) functions.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge: Cambridge University Press.
Fox, J. (2016). Applied regression analysis and generalized linear models (Second edition). Thousand Oaks CA: Sage.
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (Third edition). Thousand Oaks CA: Sage.
Hager, W. W. (1989). Updating the inverse of a matrix. SIAM Review, 31(2), 221–239.
Mersmann, O. (2023). Microbenchmark: Accurate timing functions. Retrieved from https://CRAN.R-project.org/package=microbenchmark