如何指定设置为插入符的验证保留

编程入门行业动态更新时间:2024-10-23 20:16:50

本文介绍了如何指定设置为插入符的验证保留的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我真的很喜欢至少在建模的早期阶段使用插入符号，尤其是因为它非常容易使用重采样方法。但是，我正在研究一个模型，其中训练集通过半监督自我训练添加了大量案例，因此我的交叉验证结果确实偏斜。我的解决方案是使用验证集来衡量模型的性能，但是我看不到直接在插入符号内使用验证集的方法-我错过了什么还是只是不支持？我知道我可以编写自己的包装程序来完成插入符号通常对于m所做的事情，但是如果有一种解决方法而不必这样做，那将非常好。

I really like using caret for at least the early stages of modeling, especially for it's really easy to use resampling methods. However, I'm working on a model where the training set has a fair number of cases added via semi-supervised self-training and my cross-validation results are really skewed because of it. My solution to this is using a validation set to measure model performance but I can't see a way use a validation set directly within caret - am I missing something or this just not supported? I know that I can write my own wrappers to do what caret would normally do for m, but it would be really nice if there is a work-around without having to do that.

这是我所经历的一个简单例子：

Here is a trivial example of what I am experiencing:

> library(caret) > set.seed(1) > > #training/validation sets > i <- sample(150,50) > train <- iris[-i,] > valid <- iris[i,] > > #make my model > tc <- trainControl(method="cv") > model.rf <- train(Species ~ ., data=train,method="rf",trControl=tc) > > #model parameters are selected using CV results... > model.rf 100 samples 4 predictors 3 classes: 'setosa', 'versicolor', 'virginica' No pre-processing Resampling: Cross-Validation (10 fold) Summary of sample sizes: 90, 90, 90, 89, 90, 92, ... Resampling results across tuning parameters: mtry Accuracy Kappa Accuracy SD Kappa SD 2 0.971 0.956 0.0469 0.0717 3 0.971 0.956 0.0469 0.0717 4 0.971 0.956 0.0469 0.0717 Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 2. > > #have to manually check validation set > valid.pred <- predict(model.rf,valid) > table(valid.pred,valid$Species) valid.pred setosa versicolor virginica setosa 17 0 0 versicolor 0 20 1 virginica 0 2 10 > mean(valid.pred==valid$Species) [1] 0.94

我本来是以为我可以通过为 trainControl（）对象创建自定义的 summaryFunction（）来做到这一点，但是我看不到如何引用我的模型对象以从验证集中获得预测（文档- http：// caret.r-forge.r-project/training.html -仅列出数据， lev和模型作为可能的参数）。例如，这显然不起作用：

I originally thought I could do this by creating a custom summaryFunction() for a trainControl() object but I cannot see how to reference my model object to get predictions from the validation set (the documentation - caret.r-forge.r-project/training.html - lists only "data", "lev" and "model" as possible parameters). For example this clearly will not work:

tc$summaryFunction <- function(data, lev = NULL, model = NULL){ data.frame(Accuracy=mean(predict(<model object>,valid)==valid$Species)) }

编辑：为了提出一个真正的丑陋修正，我一直在寻找是否可以从另一个函数的范围访问模型对象，但是我我什至没有看到他们的模型存储在任何地方。希望有一些优雅的解决方案，我什至都不会看到...

In an attempt to come up with a truly ugly fix, I've been looking see if I can access the model object from the scope of another function, but I'm not even seeing them model stored anywhere. Hopefully there is some elegant solution that I'm not even coming close to seeing...

> tc$summaryFunction <- function(data, lev = NULL, model = NULL){ + browser() + data.frame(Accuracy=mean(predict(model,valid)==valid$Species)) + } > train(Species ~ ., data=train,method="rf",trControl=tc) note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 . Called from: trControl$summaryFunction(testOutput, classLevels, method) Browse[1]> lapply(sys.frames(),function(x) ls(envi=x)) [[1]] [1] "x" [[2]] [1] "cons" "contrasts" "data" "form" "m" "na.action" "subset" [8] "Terms" "w" "weights" "x" "xint" "y" [[3]] [1] "x" [[4]] [1] "classLevels" "funcCall" "maximize" "method" "metric" "modelInfo" [7] "modelType" "paramCols" "ppMethods" "preProcess" "startTime" "testOutput" [13] "trainData" "trainInfo" "trControl" "tuneGrid" "tuneLength" "weights" [19] "x" "y" [[5]] [1] "data" "lev" "model"

推荐答案

我可能已经找到了解决方法，但是我并不是100％地满足我的要求，我仍然希望有人提出一些更优雅的方法。无论如何，我意识到将验证集包括在我的训练集中，并定义重采样措施以仅使用验证数据可能是最有意义的。我认为这应该可以解决上面的示例：

I think I may've found a work-around for this but I'm not 100% that it is doing what I want and I am still hoping that someone comes up with something a bit more elegant. Anyway, I realized that it probably makes the most sense to include the validation set inside my training set and just define the resampling measures to only use the validation data. I think this should do the trick for the example above:

> library(caret) > set.seed(1) > > #training/validation set indices > i <- sample(150,50) #note - I no longer need to explictly create train/validation sets > > #explicity define the cross-validation indices to be those from the validation set > tc <- trainControl(method="cv",number=1,index=list(Fold1=(1:150)[-i]),savePredictions=T) > (model.rf <- train(Species ~ ., data=iris,method="rf",trControl=tc)) 150 samples 4 predictors 3 classes: 'setosa', 'versicolor', 'virginica' No pre-processing Resampling: Cross-Validation (1 fold) Summary of sample sizes: 100 Resampling results across tuning parameters: mtry Accuracy Kappa 2 0.94 0.907 3 0.94 0.907 4 0.94 0.907 Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 2. > > #i think this worked because the resampling indices line up? > all(sort(unique(model.rf$pred$rowIndex)) == sort(i)) [1] TRUE > #exact contingency from above also indicate that this works > table(model.rf$pred[model.rf$pred$.mtry==model.rf$bestTune[[1]],c("obs","pred")]) pred obs setosa versicolor virginica setosa 17 0 0 versicolor 0 20 2 virginica 0 1 10

更多推荐

如何指定设置为插入符的验证保留

本文发布于:2023-11-01 02:23:03，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1547979.html