Skip to contents

The cv() "function" method is a general function to cross-validate a model-selection procedure, such as the following: selectStepAIC() is a procedure that applies the stepAIC() model-selection function in the MASS package; selectTrans() is a procedure for selecting predictor and response transformations in regression, which uses the powerTransform() function in the car package; selectTransAndStepAIC() combines predictor and response transformations with predictor selection; and selectModelList() uses cross-validation to select a model from a list of models created by models() and employs (recursive) cross-validation to assess the predictive accuracy of this procedure.

Usage

# S3 method for `function`
cv(
  model,
  data,
  criterion = mse,
  k = 10L,
  reps = 1L,
  seed = NULL,
  working.model = NULL,
  y.expression = NULL,
  confint = n >= 400L,
  level = 0.95,
  details = k <= 10L,
  save.model = FALSE,
  ncores = 1L,
  ...
)

selectStepAIC(
  data,
  indices,
  model,
  criterion = mse,
  AIC = TRUE,
  details = TRUE,
  save.model = FALSE,
  ...
)

selectTrans(
  data,
  indices,
  details = TRUE,
  save.model = FALSE,
  model,
  criterion = mse,
  predictors,
  response,
  family = c("bcPower", "bcnPower", "yjPower", "basicPower"),
  family.y = c("bcPower", "bcnPower", "yjPower", "basicPower"),
  rounded = TRUE,
  ...
)

selectTransStepAIC(
  data,
  indices,
  details = TRUE,
  save.model = FALSE,
  model,
  criterion = mse,
  predictors,
  response,
  family = c("bcPower", "bcnPower", "yjPower", "basicPower"),
  family.y = c("bcPower", "bcnPower", "yjPower", "basicPower"),
  rounded = TRUE,
  AIC = TRUE,
  ...
)

selectModelList(
  data,
  indices,
  model,
  criterion = mse,
  k = 10L,
  k.recurse = k,
  details = k <= 10L,
  save.model = FALSE,
  seed = FALSE,
  quietly = TRUE,
  ...
)

compareFolds(object, digits = 3, ...)

# S3 method for cvSelect
coef(object, average, NAs = 0, ...)

Arguments

model

a regression model object fit to data, or for the cv() "function" method, a model-selection procedure function (see Details).

data

full data frame for model selection.

criterion

a CV criterion ("cost" or lack-of-fit) function.

k

perform k-fold cross-validation (default is 10); k may be a number or "loo" or "n" for n-fold (leave-one-out) cross-validation.

reps

number of times to replicate k-fold CV (default is 1)

seed

for R's random number generator; not used for n-fold cross-validation. If not explicitly set, a seed is randomly generated and saved to make the results reproducible. In some cases, for internal use only, seed is set to FALSE to suppress automatically setting the seed.

working.model

a regression model object fit to data, typically to begin a model-selection process; for use with selectModelList(), a list of competing models created by models().

y.expression

normally the response variable is found from the model or working.model argument; but if, for a particular selection procedure, the model or working.model argument is absent, or if the response can't be inferred from the model, the response can be specified by an expression, such as expression(log(income)), to be evaluated within the data set provided by the data argument.

confint

if TRUE (the default if the number of cases is 400 or greater), compute a confidence interval for the bias-corrected CV criterion, if the criterion is the average of casewise components.

level

confidence level (default 0.95).

details

if TRUE, save detailed information about the value of the CV criterion for the cases in each fold and the regression coefficients (and possibly other information) with that fold deleted; default is TRUE if k is 10 or smaller, FALSE otherwise.

save.model

save the model that's selected using the full data set (default, FALSE).

ncores

number of cores to use for parallel computations (default is 1, i.e., computations aren't done in parallel)

...

for cvSelect() and the cv() "function" method, arguments to be passed to procedure(); for selectStepAIC() and selectTransStepAIC(), arguments to be passed to stepAIC().

indices

indices of cases in data defining the current fold.

AIC

if TRUE (the default) use the AIC as the model-selection criterion; if FALSE, use the BIC. The k argument to stepAIC() is set accordingly (note that this is distinct from the number of folds k).

predictors

character vector of names of the predictors in the model to transform; if missing, no predictors will be transformed.

response

name of the response variable; if missing, the response won't be transformed.

family

transformation family for the predictors, one of "bcPower", "bcnPower", "yjPower", "basicPower", with "bcPower" as the default. These are the names of transformation functions in the car package; see bcPower().

family.y

transformation family for the response, with "bcPower" as the default.

rounded

if TRUE (the default) use nicely rounded versions of the estimated transformation parameters (see bcPower()).

k.recurse

the number of folds for recursive CV; defaults to the value of k; may be specified as "loo" or "n" as well as an integer.

quietly

if TRUE (the default), simple messages (for example about the value to which the random-number generator seed is set), but not warnings or errors, are suppressed.

object

an object of class "cvSelect".

digits

significant digits for printing coefficients (default 3).

average

if supplied, a function, such as mean or median, to use us in averaging estimates across folds; if missing, the estimates for each fold are returned.

NAs

values to substitute for NAs in calculating averaged estimates; the default, 0, is appropriate, e.g., for regression coefficients; the value 1 might be appropriate for power-transformation estimates.

Value

An object of class "cvSelect", inheriting from class "cv", with the CV criterion ("CV crit"), the bias-adjusted CV criterion ("adj CV crit"), the criterion for the model applied to the full data ("full crit"), the confidence interval and level for the bias-adjusted CV criterion ("confint"), the number of folds ("k"), the seed for R's random-number generator ("seed"), and (optionally) a list of coefficients (or, in the case of selectTrans(), estimated transformation parameters, and in the case of selectTransAndStepAIC(), both regression coefficients and transformation parameters) for the selected models for each fold ("coefficients"). If reps > 1, then an object of class c("cvSelectList", "cvList") is returned, which is literally a list of c("cvSelect", "cv") objects.

Details

The model-selection function supplied as the procedure (for cvSelect()) or model (for cv()) argument should accept the following arguments:

data

set to the data argument to cvSelect() or cv().

indices

the indices of the rows of data defining the current fold; if missing, the model-selection procedure is applied to the full data.

other arguments

to be passed via ... from cvSelect() or cv().

procedure() or model() should return a list with the following named elements: fit.i, the vector of predicted values for the cases in the current fold computed from the model omitting these cases; crit.all.i, the CV criterion computed for all of the cases using the model omitting the current fold; and (optionally) coefficients, parameter estimates from the model computed omitting the current fold.

When the indices argument is missing, procedure() returns the cross-validation criterion for all of the cases based on the model fit to all of the cases.

For examples of model-selection functions for the procedure argument, see the code for selectStepAIC(), selectTrans(), and selectTransAndStepAIC().

For additional information, see the "Cross-validating model selection" vignette (vignette("cv-select", package="cv")) and the "Extending the cv package" vignette (vignette("cv-extend", package="cv")).

Functions

  • cv(`function`): cv() method for applying a model model-selection (or specification) procedure.

  • selectStepAIC(): select a regression model using the stepAIC() function in the MASS package.

  • selectTrans(): select transformations of the predictors and response using powerTransform() in the car package.

  • selectTransStepAIC(): select transformations of the predictors and response, and then select predictors.

  • selectModelList(): select a model using (recursive) CV.

  • compareFolds(): print the coefficients from the selected models for the several folds.

  • coef(cvSelect): extract the coefficients from the selected models for the several folds and possibly average them.

Examples

data("Auto", package="ISLR2")
m.auto <- lm(mpg ~ . - name - origin, data=Auto)
cv(selectStepAIC, Auto, seed=123, working.model=m.auto)
#> R RNG seed set to 123
#> 10-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 11.87795
#> bias-adjusted cross-validation criterion = 11.8753
#> full-sample criterion = 11.65549 
cv(selectStepAIC, Auto, seed=123, working.model=m.auto,
         AIC=FALSE, k=5, reps=3) # via BIC
#> R RNG seed set to 123
#> R RNG seed set to 319729
#> R RNG seed set to 889422
#> 
#> Replicate 1:
#> 5-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 11.91891
#> bias-adjusted cross-validation criterion = 11.8895
#> full-sample criterion = 11.65549 
#> 
#> Replicate 2:
#> 5-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 11.80601
#> bias-adjusted cross-validation criterion = 11.78946
#> full-sample criterion = 11.65549 
#> 
#> Replicate 3:
#> 5-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 11.81343
#> bias-adjusted cross-validation criterion = 11.79576
#> full-sample criterion = 11.65549 
#> 
#> Average:
#> 5-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 11.86432 (0.06311657)
#> bias-adjusted cross-validation criterion = 11.84105 (0.05600268)
#> full-sample criterion = 11.65549 
data("Prestige", package="carData")
m.pres <- lm(prestige ~ income + education + women,
             data=Prestige)
cvt <- cv(selectTrans, data=Prestige, working.model=m.pres, seed=123,
          predictors=c("income", "education", "women"),
          response="prestige", family="yjPower")
#> R RNG seed set to 123
cvt
#> 10-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 58.68193
#> bias-adjusted cross-validation criterion = 58.26258
#> full-sample criterion = 50.60016 
compareFolds(cvt)
#> CV criterion by folds:
#>    fold.1    fold.2    fold.3    fold.4    fold.5    fold.6    fold.7    fold.8 
#> 100.01805  27.79099  66.84054  60.97286  30.75663  17.98406 110.27446  66.35537 
#>    fold.9   fold.10 
#>  37.32475  67.45709 
#> 
#> Coefficients by folds:
#>         lam.education lam.income lam.women lambda
#> Fold 1          1.000      0.330     0.330      1
#> Fold 2          1.000      0.330     0.330      1
#> Fold 3          1.000      0.330     0.000      1
#> Fold 4          1.000      0.330     0.330      1
#> Fold 5          1.000      0.330     0.000      1
#> Fold 6          1.000      0.330     0.000      1
#> Fold 7          1.000      0.000     0.000      1
#> Fold 8          1.000      0.330     0.000      1
#> Fold 9          1.000      0.500     0.330      1
#> Fold 10         1.000      0.330     0.159      1
coef(cvt, average=median, NAs=1) # NAs not really needed here
#> lam.education    lam.income     lam.women        lambda 
#>    1.00000000    0.33000000    0.07948918    1.00000000 
cv(m.pres, seed=123)
#> R RNG seed set to 123
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 66.91929
#> bias-adjusted cross-validation criterion = 66.45514
#> full-sample criterion = 59.15265 
Auto$year <- as.factor(Auto$year)
Auto$origin <- factor(Auto$origin,
                      labels=c("America", "Europe", "Japan"))
rownames(Auto) <- make.names(Auto$name, unique=TRUE)
Auto$name <- NULL
m.auto <- lm(mpg ~ . , data=Auto)
cvs <- cv(selectTransStepAIC, data=Auto, seed=76692, working.model=m.auto,
          criterion=medAbsErr,
          predictors=c("cylinders", "displacement", "horsepower",
                       "weight", "acceleration"),
          response="mpg", AIC=FALSE)
#> R RNG seed set to 76692
cvs
#> 10-Fold Cross Validation
#> criterion: medAbsErr
#> cross-validation criterion = 1.476272
#> full-sample criterion = 1.339604 
compareFolds(cvs)
#> CV criterion by folds:
#>   fold.1   fold.2   fold.3   fold.4   fold.5   fold.6   fold.7   fold.8 
#> 1.563930 1.562867 1.369843 1.282772 1.246066 1.582635 1.340277 1.183112 
#>   fold.9  fold.10 
#> 1.138892 1.585400 
#> 
#> Coefficients by folds:
#>         (Intercept) horsepower lam.acceleration lam.cylinders lam.displacement
#> Fold 1      9.71384   -0.17408          0.50000      -0.50000          0.18555
#> Fold 2      9.09355   -0.31285          0.00000      -0.50000          0.19452
#> Fold 3      9.61824   -0.19248          0.00000      -0.50000          0.15461
#> Fold 4      9.49410   -0.25380          0.00000      -0.50000          0.12800
#> Fold 5      9.14403   -0.14934          0.00000      -0.50000          0.18388
#> Fold 6      9.63481   -0.16739          0.00000      -0.50000          0.17122
#> Fold 7      9.60487   -0.18119          0.00000       0.00000          0.19063
#> Fold 8      8.92286   -0.29237          0.00000      -0.50000          0.33000
#> Fold 9      8.71492   -0.22484          0.00000      -0.50000          0.16881
#> Fold 10     9.61727   -0.17086          0.00000       0.00000          0.16638
#>         lam.horsepower lam.weight   lambda   weight   year71   year72   year73
#> Fold 1         0.00000    0.00000  0.00000 -0.74636  0.03764 -0.00327 -0.02477
#> Fold 2         0.00000    0.00000  0.00000 -0.48573  0.02141 -0.01416 -0.03720
#> Fold 3         0.00000    0.00000  0.00000 -0.72085  0.01128 -0.02569 -0.03872
#> Fold 4         0.00000    0.00000  0.00000 -0.57844  0.02269 -0.02130 -0.05069
#> Fold 5         0.00000    0.00000  0.00000 -0.69081  0.02531 -0.01062 -0.04625
#> Fold 6         0.00000    0.00000  0.00000 -0.74049  0.02456  0.00759 -0.03412
#> Fold 7         0.00000    0.00000  0.00000 -0.72741  0.02341 -0.01438 -0.04241
#> Fold 8         0.00000    0.00000  0.00000 -0.48913  0.02720 -0.01844 -0.05477
#> Fold 9         0.00000    0.00000  0.00000 -0.48188  0.00896 -0.03530 -0.04815
#> Fold 10        0.00000    0.00000  0.00000 -0.73550  0.02937 -0.00899 -0.03814
#>           year74   year75   year76   year77   year78   year79   year80   year81
#> Fold 1   0.05606  0.07080  0.07250  0.14420  0.14281  0.23266  0.35127  0.25635
#> Fold 2   0.04374  0.04010  0.06713  0.13116  0.14857  0.21825  0.33253  0.26173
#> Fold 3   0.05187  0.03837  0.06399  0.11593  0.12601  0.20499  0.32821  0.24478
#> Fold 4   0.04488  0.04238  0.05907  0.12717  0.13902  0.22986  0.33253  0.25407
#> Fold 5   0.05039  0.05596  0.07044  0.13356  0.14724  0.24675  0.33331  0.26938
#> Fold 6   0.06266  0.06940  0.07769  0.14211  0.14647  0.23532  0.34761  0.26737
#> Fold 7   0.04368  0.03327  0.07175  0.12777  0.13816  0.23722  0.33822  0.27453
#> Fold 8   0.04670  0.06595  0.08192  0.13289  0.13899  0.23003  0.33022  0.25909
#> Fold 9   0.01986  0.02981  0.05843  0.10584  0.11625  0.20625  0.31601  0.23350
#> Fold 10  0.05408  0.04881  0.07862  0.14101  0.14313  0.23258  0.35649  0.26214
#>           year82 acceleration displacement cylinders originEurope originJapan
#> Fold 1   0.30546                                                             
#> Fold 2   0.30876     -0.19017     -0.03233                                   
#> Fold 3   0.29204                                                             
#> Fold 4   0.28048     -0.14790               -0.30064                         
#> Fold 5   0.32594                                          0.06261        0.04
#> Fold 6   0.33062                                                             
#> Fold 7   0.30436                                                             
#> Fold 8   0.30500     -0.17392     -0.01683                                   
#> Fold 9   0.29284     -0.14623     -0.05409                                   
#> Fold 10  0.32421                                                             
data("Duncan", package="carData")
m1 <- lm(prestige ~ income + education, data=Duncan)
m2 <- lm(prestige ~ income + education + type, data=Duncan)
m3 <- lm(prestige ~ (income + education)*type, data=Duncan)
cv(selectModelList, data=Duncan, seed=5962,
   working.model=models(m1, m2, m3)) # recursive CV
#> R RNG seed set to 5962
#> 10-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 107.3751
#> bias-adjusted cross-validation criterion = 135.4726
#> full-sample criterion = 113.7905