缩放回归指标的目标值是否可以接受？

编程入门行业动态更新时间:2024-10-24 20:14:29

本文介绍了缩放回归指标的目标值是否可以接受？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我对MLPRegressor，ForestRegression和线性回归的RMSE和MAE很高，仅对输入变量进行了缩放（30,000+），但是当我对目标值进行缩放时也得到了RMSE（0.2），我想知道这是否可以接受

I am getting very high RMSE and MAE for MLPRegressor , ForestRegression and Linear regression with only input variables scaled (30,000+) however when i scale target values aswell i get RMSE (0.2) , i will like to know if that is acceptable thing to do.

其次，测试具有更好的R平方值（即火车的0.98和0.85）是正常的吗？

Secondly is it normal to have better R squared values for Test (ie. 0.98 and 0.85 for train)

谢谢

推荐答案

回答您的第一个问题，我认为您对您所拥有的绩效衡量标准非常困惑选择用来评估您的模型。 RMSE和MAE都对测量目标变量的范围敏感，如果要缩小目标变量，则可以肯定地确定RMSE和MAE的值，让我们举个例子说明一下。

Answering your first question, I think you are quite deceived by the performance measures which you have chosen to evaluate your model with. Both RMSE and MAE are sensitive to the range in which you measure your target variables, if you are going to scale down your target variable then for sure the values of RMSE and MAE will go down, lets take an example to illustrate that.

def rmse(y_true, y_pred): return np.sqrt(np.mean(np.square(y_true - y_pred))) def mae(y_true, y_pred): return np.mean(np.abs(y_true - y_pred))

我写了两个函数来计算RMSE和MAE。现在让我们插入一些值，看看会发生什么，

I have written two functions for computing both RMSE and MAE. Now lets plug in some values and see what happens,

y_true = np.array([2,5,9,7,10,-5,-2,2]) y_pred = np.array([3,4,7,9,8,-3,-2,1])

暂时让我们假设真实值和预测值如上所示。现在我们准备为此数据计算RMSE和MAE。

For the time being let's assume that the true and the predicted vales are as shown above. Now we are ready to compute RMSE and MAE for this data.

rmse(y_true,y_pred) 1.541103500742244 mae(y_true, y_pred) 1.375

现在让我们缩小我们的目标变量乘以10并再次计算相同的度量。

Now let's scale down our target variable by a factor of 10 and compute the same measure again.

y_scaled_true = np.array([2,5,9,7,10,-5,-2,2])/10 y_scaled_pred = np.array([3,4,7,9,8,-3,-2,1])/10 rmse(y_scaled_true,y_scaled_pred) 0.15411035007422444 mae(y_scaled_true,y_scaled_pred) 0.1375

我们现在可以很好地看到，仅通过缩放目标变量，我们的RMSE和MAE分数就下降了，从而产生了我们的模型有所改进的幻觉，但实际上并没有。当我们缩减模型的预测时，我们处于同一状态。

We can now very well see that just by scaling our target variable our RMSE and MAE scores have dropped creating an illusion that our model has improved, but actually NOT. When we scale back our model's predictions we are into the same state.

所以说到重点，MAPE（平均绝对百分比误差）可能是衡量模型性能的更好方法，并且对模型的规模不敏感。变量是度量。如果您为这两个值集计算MAPE，我们将看到它们相同，

So coming to the point, MAPE (Mean Absolute Percentage Error) could be a better way to measure your performance of the model and it is insensitive to the scale in which the variables are measure. If you compute MAPE for both the sets of values we see that they are same,

def mape(y, y_pred): return np.mean(np.abs((y - y_pred)/y)) mape(y_true,y_pred) 0.28849206349206347 mape(y_scaled_true,y_scaled_pred) 0.2884920634920635

所以最好使用MAPE而不是MAE或RMSE ，如果您希望性能度量独立于度量的尺度。

So it is better to rely on MAPE over MAE or RMSE, if you want your performance measure to be independent on the scale in which they are measured.

回答第二个问题，因为您正在处理一些复杂的模型，例如MLPRegressor和ForestRegression具有一些超参数，需要对其进行调整以避免过度拟合，找到理想的超参数级别的最佳方法是将数据分为训练，测试和验证，并使用诸如 K折交叉验证，以找到最佳设置。仅通过查看这种情况很难说上述值是否可接受。

Answering your second question, since you are dealing with some complicated models like MLPRegressor and ForestRegression which has some hyper-parameters which needs to be tuned to avoid over fitting, the best way to find the ideal levels of the hyper-parameters is to divide the data into train, test and validation and use techniques like K-Fold Cross Validation to find the optimal setting. It is quite difficult to say if the above values are acceptable or not just by looking at this one case.