OLS 回归:Scikit 与 Statsmodels?

编程入门 行业动态 更新时间:2024-10-09 12:30:43
本文介绍了OLS 回归:Scikit 与 Statsmodels?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时送ChatGPT账号..

简短版本:我在一些数据上使用了 scikit LinearRegression,但我习惯了 p 值,所以将数据放入 statsmodels OLS,虽然 R^2 是关于相同的变量系数都相差很大.这让我很担心,因为最可能的问题是我在某处犯了错误,现在我对任何一个输出都没有信心(因为我可能错误地制作了一个模型,但不知道是哪个).

Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).

更长的版本:因为我不知道问题出在哪里,所以我不确切知道要包含哪些细节,而且包含所有内容可能太多了.我也不确定是否包含代码或数据.

Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.

我的印象是 scikit 的 LR 和 statsmodels OLS 都应该做 OLS,据我所知 OLS 是 OLS,所以结果应该是一样的.

I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.

对于 scikit 的 LR,无论我是否设置 normalize=True 或 =False,结果(统计上)都相同,我觉得这有点奇怪.

For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.

对于 statsmodels OLS,我使用 sklearn 的 StandardScaler 对数据进行标准化.我添加了一列,因此它包含一个拦截(因为 scikit 的输出包含一个拦截).更多关于这里:http://statsmodels.sourceforge/devel/examples/generate/example_ols.html(添加此列并没有将变量系数改变到任何显着的程度,截距非常接近于零.)StandardScaler 不喜欢我的整数不是浮点数,所以我尝试了这个: https://github/scikit-learn/scikit-learn/issues/1709这会使警告消失,但结果完全相同.

For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github/scikit-learn/scikit-learn/issues/1709 That makes the warning go away but the results are exactly the same.

当然,我在 sklearn 方法中使用了 5 倍 cv(每次测试和训练数据的 R^2 都是一致的),而对于 statsmodels,我只是将所有数据都扔了.

Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.

sklearn 和 statsmodels 的 R^2 约为 0.41(这对社会科学有好处).这可能是一个好兆头,也可能只是一个巧合.

R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.

数据是对魔兽世界中头像的观察(来自http://mmnet.iis.sinica.edu.tw/dl/wowah/),我想每周制作一次,并带有一些不同的功能.本来这是一个数据科学课的课堂项目.

The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.

自变量包括一周内的观察次数(int),字符级别(int),如果在公会中(布尔值),何时看到(布尔值在工作日,工作日前夕,工作日晚,周末相同三个)、角色类的dummy(收集数据的时候WoW只有8个类,所以有7个dummy vars,去掉了原来的字符串分类变量)等.

Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.

因变量是每个角色在那一周内获得的等级(整数).

The dependent variable is how many levels each character gained during that week (int).

有趣的是,类似变量中的一些相对顺序在 statsmodels 和 sklearn 之间保持不变.所以,当看到"的排名顺序是相同的,尽管加载量大不相同,而角色类假人的排名顺序是相同的,尽管加载量也大不相同.

Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.

我认为这个问题类似于这个问题:Python 中的差异statsmodels OLS 和 R 的 lm

I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm

我很擅长 Python 和统计数据,可以尝试一下,但还不够好,无法弄清楚这样的事情.我尝试阅读 sklearn 文档和 statsmodels 文档,但如果答案就在我面前,我不明白.

I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.

我很想知道:

哪个输出可能是准确的?(当然,如果我错过了一个 kwarg,它们可能都是.)如果我犯了一个错误,它是什么以及如何解决它?我能不在这里问就知道这个吗?如果可以,怎么解决?

我知道这个问题有一些相当模糊的部分(没有代码、没有数据、没有输出),但我认为更多的是关于两个包的一般过程.当然,一个似乎是更多的统计数据,一个似乎更像是机器学习,但它们都是 OLS,所以我不明白为什么输出不一样.

I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.

(我什至尝试了其他一些 OLS 调用来进行三角测量,一个给出的 R^2 低得多,一个循环了 5 分钟,我杀死了它,另一个崩溃了.)

(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)

谢谢!

推荐答案

听起来您没有向两个过程提供相同的回归量矩阵 X(但请参见下文).这是一个示例,向您展示需要为 sklearn 和 statsmodels 使用哪些选项才能产生相同的结果.

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# Generate artificial data (2 regressors + constant)
nobs = 100 
X = np.random.random((nobs, 2)) 
X = sm.add_constant(X)
beta = [1, .1, .5] 
e = np.random.random(nobs)
y = np.dot(X, beta) + e 

# Fit regression model
sm.OLS(y, X).fit().params
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

LinearRegression(fit_intercept=False).fit(X, y).coef_
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

正如评论者所建议的那样,即使您给两个程序提供相同的 X,X 也可能没有完整的列排名,并且他们 sm/sk 可能会在后台采取(不同的)操作来使 OLS 计算运行通过(即删除不同的列).

As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

我建议你使用 pandaspatsy 来解决这个问题:

I recommend you use pandas and patsy to take care of this:

import pandas as pd
from patsy import dmatrices

dat = pd.read_csv('wow.csv')
y, X = dmatrices('levels ~ week + character + guild', data=dat)

或者,statsmodels 公式接口:

import statsmodels.formula.api as smf
dat = pd.read_csv('wow.csv')
mod = smf.ols('levels ~ week + character + guild', data=dat).fit()

这个例子可能有用:http://statsmodels.sourceforge/devel/example_formulas.html

这篇关于OLS 回归:Scikit 与 Statsmodels?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

更多推荐

[db:关键词]

本文发布于:2023-03-31 22:38:50,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/815337.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:OLS   Scikit   Statsmodels

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!