线性模型和 dplyr

编程入门 行业动态 更新时间:2024-10-08 00:33:08
本文介绍了线性模型和 dplyr - 更好的解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

关于我最近提出的问题,我得到了很多很好的反馈并被指导使用dplyr来转换一些数据.我在使用 lm() 时遇到问题,并试图从这个转换后的数据中找到一个斜率,并认为我会提出一个新问题.

I got a lot of good feedback on a question I recently asked and was guided to use dplyr to transform some data. I'm having an issue with lm() and trying to find a slope from this transformed data and thought I'd open up a new question.

首先我有如下所示的数据:

First I have data that looks like this:

Var1 Var2 Var3 Time Temp a w j 9/9/2014 20 a w j 9/9/2014 15 a w k 9/20/2014 10 a w j 9/10/2014 0 b x L 9/12/2014 30 b x L 9/12/2014 10 b y k 9/13/2014 20 b y k 9/13/2014 15 c z j 9/14/2014 20 c z j 9/14/2014 10 c z k 9/14/2014 11 c w l 9/10/2014 45 a d j 9/22/2014 20 a d k 9/15/2014 4 a d l 9/15/2014 23 a d k 9/15/2014 11

我想要它的形式(Slope 和 Pearson 的模拟值用于说明):

And I want it in the form of this (values for Slope and Pearson simulated for illustration):

V1 V2 V3 Slope Pearson a w j -3 -0.9 a w k 2 0 a d j 1.5 0.6 a d k 0 0.5 a d l -0.5 -0.6 b x L 12 0.7 b y k 4 0.6 c z j -1 -0.5 c z k -3 -0.4 c w l -10 -0.9

斜率是线性最小二乘斜率.理论上,脚本看起来像这样:

The slope being a linear-least-squares slope. In theory, the script would look like so:

library(dplyr) data <- read.table("clipboard",sep=" ",quote="",header=T) newdata = summarise(group_by(data ,Var1 ,Var2 ,Var3 ) ,Slope = lm(Temp ~ Time)$coeff[2] ,Pearson = cor(Time, Temp, method="pearson") )

但是 R 会抛出一个错误,比如它找不到时间或温度.它可以运行 lm(data$Temp ~ data$Time)$coeff[2],但返回整个数据集的斜率,而不是我正在寻找的子集形式.cor() 似乎在 group_by 部分运行得很好,所以我需要传递给 lm() 的特定语法它以类似的方式运行还是完全使用不同的函数来获得从子集传递的斜率?

But R throws an error like it can't find Time or Temp. It can run lm(data$Temp ~ data$Time)$coeff[2], but returns the slope for the entire data set and not the subsetted form that I'm looking for. cor() seems to run just fine in the group_by section, so is there a specific syntax I need to pass to lm() to have it run in a similar manner or use a different function entirely to get a slope passed from the subset?

推荐答案

这里有几个问题.

  • 如果您按 3 个变量(甚至 2 个)对数据进行分组,则您没有足够的不同值来运行线性回归模型
  • Pearson 需要两个数值,而 Time 是一个转换为数字没有多大意义的因素
  • 这里的第三个问题是你需要使用 do 来运行你的线性模型
  • If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model
  • Pearson requires two numeric values, while Time is a factor which converting to numeric won't make much sense
  • The third issue here is that you will need to use do in order to run your linear model
  • 这是仅在 V1

    data %>% group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it do(mod = lm(Temp ~ Time, data = .)) %>% mutate(Slope = summary(mod)$coeff[2]) %>% select(-mod) # Source: local data frame [3 x 2] # Groups: <by row> # # Var1 Slope # 1 a 12.66667 # 2 b -2.50000 # 3 c -31.33333

    如果你有两个数字变量,你也可以使用 do 来计算相关性,例如(我会创建一些虚拟数字变量来说明)

    If you do have two numeric variables, you can use do in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration)

    data %>% mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables test2 = sample(1:3, n(), replace = TRUE)) %>% group_by(Var1) %>% do(mod = lm(Temp ~ Time, data = .), mod2 = cor(.$test1, .$test2, method = "pearson")) %>% mutate(Slope = summary(mod)$coeff[2], Pearson = mod2[1]) %>% select(-mod, -mod2) # Source: local data frame [3 x 3] # Groups: <by row> # # Var1 Slope Pearson # 1 a 12.66667 0.25264558 # 2 b -2.50000 -0.09090909 # 3 c -31.33333 0.30151134

    奖励解决方案:您也可以使用 data.table 包非常有效/轻松地完成此操作

    Bonus solution: you can do this quite efficiently/easily with data.table package too

    library(data.table) setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1] # Var1 Slope # 1: a 12.66667 # 2: b -2.50000 # 3: c -31.33333

    或者如果我们也想创建一些虚拟变量

    Or if we want to create some dummy variables too

    library(data.table) setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE), test2 = sample(1:3, .N, replace = TRUE))][, list(Slope = summary(lm(Temp ~ Time))$coeff[2], Pearson = cor(test1, test2, method = "pearson")), Var1] # Var1 Slope Pearson # 1: a 12.66667 -0.02159168 # 2: b -2.50000 -0.81649658 # 3: c -31.33333 -1.00000000

    更多推荐

    线性模型和 dplyr

    本文发布于:2023-11-29 01:57:19,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1644832.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:线性   模型   dplyr

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!