线性模型和dplyr

编程入门 行业动态 更新时间:2024-10-07 20:26:47
本文介绍了线性模型和dplyr - 更好的解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我在我最近问的问题中收到了很多好的反馈并引导使用dplyr来转换一些数据。我有一个lm()的问题,并尝试从这个转换的数据找到一个斜率,并认为我会打开一个新的问题。

首先我有数据看起来像这样:

Var1 Var2 Var3时间温度awj 9/9/2014 20 awj 9 / 9/2014 15 awk 9/20/2014 10 awj 9/10/2014 0 bx L 9/12/2014 30 bx L 9/12/2014 10 byk 9/13/2014 20 byk 9/13/2014 15 czj 9/14/2014 20 czj 9/14/2014 10 czk 9/14/2014 11 cwl 9/10/2014 45 adj 9/22/2014 20 adk 9/15/2014 4 adl 9/15/2014 23 adk 9/15/2014 11

我希望以此形式(Slope和Pearson模拟图示):

V1 V2 V3斜坡皮尔逊awj -3 -0.9 awk 2 0 adj 1.5 0.6 adk 0 0.5 adl -0.5 -0.6 bx L 12 0.7 byk 4 0.6 czj -1 -0.5 czk -3 -0.4 cwl -10 -0.9

斜率为线性 - 最小二乘法斜率。在理论上,脚本将如下所示:

library(dplyr) data< read.table(clipboard,sep =\t,quote =,header = T) newdata = summarize(group_by(data ,Var1 ,Var2 ,Var3 ),Slope = lm(Temp〜Time)$ coeff [2] ,Pearson = cor(Time,Temp,method =pearson) )

但是R抛出一个错误,因为它找不到Time或Temp。它可以运行 lm(data $ Temp〜data $ Time)$ coeff [2] ,但返回整个数据集的斜率,而不是我的子集寻找。 cor()似乎在 group_by 部分中运行正常,所以我需要传递一个特定的语法 lm()以类似的方式运行或完全使用不同的函数来获取从子集传递的斜率?

解决方案

这里有几个问题。

  • 如果您将数据分为3个变量(甚至是2个),则您没有足够的不同值才能运行线性回归模型
  • Pearson需要两个数字值,而 Time 是一个转换为数字的因素不会很有意义
  • 这里的第三个问题是您需要使用 do 才能运行线性模型
  • 以下是仅在 V1上分组的图示

    data%>% group_by(Var1)%>%#如果您的真实数据集启用,您可以添加其他分组变量 do(mod = lm(Temp 〜%)$% mutate(Slope = summary(mod)$ coeff [2])%>% select(-mod)#本地数据帧[3 x 2] #组:< by row> ##Var1 Slope #1 a 12.66667 #2 b -2.50000 #3 c -31.33333 / pre>

    如果您有两个数字变量,可以使用 do 为了计算相关性,例如(我将创建一些虚拟数字变量来说明)

    data%> ;% mutate(test1 = sample(1:3,n(),replace = TRUE),#创建一些数值变量 test2 = sample(1:3,n(),replace = TRUE) )%>% group_by(Var1)%>% do(mod = lm(Temp〜Time,data =。), mod2 = cor(。$ test1, test2,method =pearson))%>% mutate(Slope = summary(mod)$ coeff [2], Pearson = mod2 [1])%>%选择(-mod,-mod2) #来源:本地数据框[3 x 3] #组:< by row> ##Var1斜率皮尔森#1 a 12.66667 0.25264558 #2 b -2.50000 -0.09090909 #3 c -31.33333 0.30151134

    奖金解决方案:您可以使用 data.table package too

    library(data.table) setDT(data )[,list(Slope = summary(lm(Temp〜Time))$ coeff [2]),Var1] #Var1 Slope #1:a 12.66667 #2:b - 2.50000 #3:c -31.33333

    或者如果我们还要创建一些虚拟变量

    library(data.table) setDT(data)[,`:=`(test1 = sample :3,.N,replace = TRUE), test2 = sample(1:3,.N,replace = TRUE))] [, list(Slope = summary(lm(Temp〜Time) )$ coeff [2], Pearson = cor(test1,test2,method =pearson)),Var1] #Var1 Slope Pearson #1:a 12 .66667 -0.02159168 #2:b -2.50000 -0.81649658 #3:c -31.33333 -1.00000000

    I got a lot of good feedback on a question I recently asked and was guided to use dplyr to transform some data. I'm having an issue with lm() and trying to find a slope from this transformed data and thought I'd open up a new question.

    First I have data that looks like this:

    Var1 Var2 Var3 Time Temp a w j 9/9/2014 20 a w j 9/9/2014 15 a w k 9/20/2014 10 a w j 9/10/2014 0 b x L 9/12/2014 30 b x L 9/12/2014 10 b y k 9/13/2014 20 b y k 9/13/2014 15 c z j 9/14/2014 20 c z j 9/14/2014 10 c z k 9/14/2014 11 c w l 9/10/2014 45 a d j 9/22/2014 20 a d k 9/15/2014 4 a d l 9/15/2014 23 a d k 9/15/2014 11

    And I want it in the form of this (values for Slope and Pearson simulated for illustration):

    V1 V2 V3 Slope Pearson a w j -3 -0.9 a w k 2 0 a d j 1.5 0.6 a d k 0 0.5 a d l -0.5 -0.6 b x L 12 0.7 b y k 4 0.6 c z j -1 -0.5 c z k -3 -0.4 c w l -10 -0.9

    The slope being a linear-least-squares slope. In theory, the script would look like so:

    library(dplyr) data <- read.table("clipboard",sep="\t",quote="",header=T) newdata = summarise(group_by(data ,Var1 ,Var2 ,Var3 ) ,Slope = lm(Temp ~ Time)$coeff[2] ,Pearson = cor(Time, Temp, method="pearson") )

    But R throws an error like it can't find Time or Temp. It can run lm(data$Temp ~ data$Time)$coeff[2], but returns the slope for the entire data set and not the subsetted form that I'm looking for. cor() seems to run just fine in the group_by section, so is there a specific syntax I need to pass to lm() to have it run in a similar manner or use a different function entirely to get a slope passed from the subset?

    解决方案

    You have several issues here.

  • If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model
  • Pearson requires two numeric values, while Time is a factor which converting to numeric won't make much sense
  • The third issue here is that you will need to use do in order to run your linear model
  • Here's an illustration for grouping only on V1

    data %>% group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it do(mod = lm(Temp ~ Time, data = .)) %>% mutate(Slope = summary(mod)$coeff[2]) %>% select(-mod) # Source: local data frame [3 x 2] # Groups: <by row> # # Var1 Slope # 1 a 12.66667 # 2 b -2.50000 # 3 c -31.33333

    If you do have two numeric variables, you can use do in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration)

    data %>% mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables test2 = sample(1:3, n(), replace = TRUE)) %>% group_by(Var1) %>% do(mod = lm(Temp ~ Time, data = .), mod2 = cor(.$test1, .$test2, method = "pearson")) %>% mutate(Slope = summary(mod)$coeff[2], Pearson = mod2[1]) %>% select(-mod, -mod2) # Source: local data frame [3 x 3] # Groups: <by row> # # Var1 Slope Pearson # 1 a 12.66667 0.25264558 # 2 b -2.50000 -0.09090909 # 3 c -31.33333 0.30151134

    Bonus solution: you can do this quite efficiently/easily with data.table package too

    library(data.table) setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1] # Var1 Slope # 1: a 12.66667 # 2: b -2.50000 # 3: c -31.33333

    Or if we want to create some dummy variables too

    library(data.table) setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE), test2 = sample(1:3, .N, replace = TRUE))][, list(Slope = summary(lm(Temp ~ Time))$coeff[2], Pearson = cor(test1, test2, method = "pearson")), Var1] # Var1 Slope Pearson # 1: a 12.66667 -0.02159168 # 2: b -2.50000 -0.81649658 # 3: c -31.33333 -1.00000000

    更多推荐

    线性模型和dplyr

    本文发布于:2023-11-29 01:57:36,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1644834.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:线性   模型   dplyr

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!