R caret：是否将受试者与数据子集进行交叉验证以进行培训？

编程入门行业动态更新时间:2024-10-28 10:33:14

本文介绍了R caret：是否将受试者与数据子集进行交叉验证以进行培训？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我想使用R插入符号执行休假主题交叉验证（请参见此示例），但仅在训练中使用数据的一部分来创建CV模型。尽管如此，遗弃的CV分区还是应该整体使用，因为我需要对遗留主体的所有数据进行测试（无论是由于计算限制而无法用于训练的数百万个样本）。

I want to perform leave subject out cross validation with R caret (cf. this example) but only use a subset of the data in training for creating CV models. Still, the left out CV partition should be used as a whole, as I need to test on all data of a left out subject (no matter if it's millions of samples that cannot be used in training due to computational restrictions).

我使用子集和 index 参数$ caret :: train 和 caret :: trainControl 即可实现。从我的观察来看，这应该可以解决问题，但实际上我很难确保评估仍以离开主题的方式进行。也许有经验的人可以对此有所了解：

I've created a minimal 2 class classification example using the subset and index parameters of caret::train and caret::trainControl to achieve this. From my observation this should solve the problem, but I have a hard time actually ensuring that the evaluation is still done in a leave-subject-out way. Maybe someone with experience in this task could shed some light on this:

library(plyr) library(caret) library(pROC) library(ggplot2) # with diamonds we want to predict cut and look at results for different colors = subjects d <- diamonds d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem d$cut <- factor(d$cut) indexes_data <- c(1,5,6,8:10) indexes_labels <- 2 # population independent CV indexes for trainControl index <- llply(unique(d[,3]), function(cls) c(which(d[,3]!=cls))) names(index) <- paste0('sub_', unique(d[,3])) str(index) # indexes used for training models with CV = OK m3 <- train(x = d[,indexes_data], y = d[,indexes_labels], method = 'glm', metric = 'ROC', subset = sample(nrow(d), 5000), # does this subset the data used for training and obtaining models, but not the left out partition used for estimating CV performance? trControl = trainControl(returnResamp = 'final', savePredictions = T, classProbs = T, summaryFunction = twoClassSummary, index = index)) str(m3$resample) # all samples used once = OK # performance over all subjects myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)

plot（myRoc，main ='all'）

plot(myRoc, main = 'all')

l_ply（unique（m3 $ pred $ Resample），.fun = function（cls）{ pred_sub<-m3 $ pred [m3 $ pred $ Resample == cls，] myRoc<-roc（predictor = pred_sub [，3]，response = pred_sub $ obs）图（myRoc ，main = cls）}）

l_ply(unique(m3$pred$Resample), .fun = function(cls) { pred_sub <- m3$pred[m3$pred$Resample==cls,] myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs) plot(myRoc, main = cls) } )

感谢您的时间！

推荐答案

同时使用 index 和 indexOut caret :: trainControl 中的参数似乎可以解决问题（感谢Max提供了提示）。这是更新的代码：

Using both the index and indexOut parameter in caret::trainControl at the same time seems to do the trick (thanks to Max for the hint in this question). Here is the updated code:

library(plyr) library(caret) library(pROC) library(ggplot2) str(diamonds) # with diamonds we want to predict cut and look at results for different colors = subjects d <- diamonds d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem d$cut <- factor(d$cut) indexes_data <- c(1,5,6,8:10) indexes_labels <- 2 # population independent CV partitions for training and left out partitions for evaluation indexes_populationIndependence_subjects <- 3 index <- llply(unique(d[,indexes_populationIndependence_subjects]), function(cls) c(which(d[,indexes_populationIndependence_subjects]!=cls))) names(index) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects])) indexOut <- llply(index, function(part) (1:nrow(d))[-part]) names(indexOut) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects])) # subsample partitions for training index <- llply(index, function(i) sample(i, 1000)) m3 <- train(x = d[,indexes_data], y = d[,indexes_labels], method = 'glm', metric = 'ROC', trControl = trainControl(returnResamp = 'final', savePredictions = T, classProbs = T, summaryFunction = twoClassSummary, index = index, indexOut = indexOut)) m3$resample # seems OK str(m3$pred) # seems OK myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs) plot(myRoc, main = 'all') # analyze results per subject l_ply(unique(m3$pred$Resample), .fun = function(cls) { pred_sub <- m3$pred[m3$pred$Resample==cls,] myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs) plot(myRoc, main = cls) } )

不过，我不确定这是否真的是通过人口独立的方式进行估计的，因此，如果有人知道详细信息，请分享您的想法！

Still, I'm not absolutely sure if this is actually does the estimation in a population independent way, so if anybody has knowledge about the details please share your thoughts!