将降采样后的预测概率转换为分类中的实际概率(使用mlr)

编程入门行业动态更新时间:2024-10-09 20:24:39

本文介绍了将降采样后的预测概率转换为分类中的实际概率(使用mlr)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

如果在不平衡二进制目标变量的情况下使用欠采样来训练模型，则预测方法会在假设平衡数据集的情况下计算概率.对于不平衡的数据，如何将这些概率转换为实际概率?转换参数/函数是在mlr软件包中还是在另一个软件包中实现的?例如:

If I use undersampling in case of an unbalanced binary target variable to train a model, the prediction method calculates probabilities under the assumption of a balanced data set. How can I convert these probabilities to actual probabilities for the unbalanced data? Is the a conversion argument/function implemented in the mlr package or another package? For example:

a <- data.frame(y=factor(sample(0:1, prob = c(0.1,0.9), replace=T, size=100))) a$x <- as.numeric(a$y)+rnorm(n=100, sd=1) task <- makeClassifTask(data=a, target="y", positive="0") learner <- makeLearner("classif.binomial", predict.type="prob") learner <- makeUndersampleWrapper(learner, usw.rate = 0.1, usw.cl = "1") model <- train(learner, task, subset = 1:50) pred <- predict(model, task, subset = 51:100) head(pred$data)

推荐答案

[Dal Pozzolo等人，2015] .

论文标题:通过欠采样校准概率不平衡分类的问题" Andrea Dal Pozzolo ，奥利维尔·卡伦(Olivier Caelen)† ，里德·约翰逊(Reid A. Johnson) ，吉安卢卡(Nianluca Bontempi)

Paper Title: "Calibrating Probability with Undersampling for Unbalanced Classification" Andrea Dal Pozzolo , Olivier Caelen† , Reid A. Johnson , Gianluca Bontempi

它专门设计用于在下采样情况下解决校准问题(即，将分类器的预测概率转换为不平衡情况下的非概率).

It is specifically designed to tackle the issue of calibration (i.e. transforming predicted probabilities of your classifier into atcual probabilities in the unbalanced case) in the case of downsampling.

您只需要使用以下公式来校正预测概率p_s:

You just have to correct your predicted probability p_s using the following formula:

p = beta * p_s / ((beta-1) * p_s + 1)

其中beta是在原始训练集中被采样后的多数类实例数量与多数类实例数量之比.

where beta is the ratio of the number majority class instances after undersampling over the number majority class ones in the original training set.

其他方法已经提出了其他不专门针对下采样偏差的方法.其中最受欢迎的是以下几种:

Other methods Other methods which are not specifically focused on the downsampling bias have been proposed. Among which the most popular ones are the following: