问题描述
限时送ChatGPT账号..n <- 3
strata <- rep(1:4, each=n)
y <- rnorm(n =12)
x <- 1:12
category <- rep(c("A", "B", "C"), times = 4)
df <- cbind.data.frame(y, x, strata, category)
我想首先按strata"将我的数据拆分为一个列表,然后我想再次按category"拆分新列表中的所有数据框.最后,我想在每个结果数据框内对 x 上的 y 进行回归(在这种情况下,每个数据框将是一行,但在实际数据中,每个层的长度不同,层内的类别数也不同).
I want to first split my data into a list by "strata", and then I want to again split all the data frames inside the new list by "category". And finally I want to regress y on x inside each of the resulting data frames (in this case each data frame would be one row but in the actual data there are different lengths of each strata and a different number of categories inside strata).
推荐答案
R 中的规范方式是使用 split
:
The canonical way in R is to use split
:
L <- split(df, df[,c("strata","category")])
L
# $`1.A`
# y x strata category
# 1 -1.120867 1 1 A
# $`2.A`
# y x strata category
# 4 -1.023001 4 2 A
# $`3.A`
# y x strata category
# 7 0.5411806 7 3 A
# $`4.A`
# y x strata category
# 10 1.546789 10 4 A
# $`1.B`
# y x strata category
# 2 0.6730641 2 1 B
# $`2.B`
# y x strata category
# 5 -1.466816 5 2 B
# $`3.B`
# y x strata category
# 8 -0.1955617 8 3 B
# $`4.B`
# y x strata category
# 11 -0.660904 11 4 B
# $`1.C`
# y x strata category
# 3 -0.9880206 3 1 C
# $`2.C`
# y x strata category
# 6 0.4111802 6 2 C
# $`3.C`
# y x strata category
# 9 -0.03311637 9 3 C
# $`4.C`
# y x strata category
# 12 0.6799109 12 4 C
12 元素列表的名称(此处)是两个分类变量的字符串连接,.
-delimited;这很容易被覆盖(手动).
The names of the 12-element list (here) are the string-concatenation of the two categorical variables, .
-delimited; this can easily be overridden (manually).
从这里开始,要对每个元素进行回归,您可能会执行以下操作:
From here, to do regression on every element, you'd likely do something like:
models <- lapply(L, function(x) lm(..., data=x))
(或您计划使用的任何回归工具).
(or whichever regression tool you are planning to use).
如果您愿意,可以一步完成,
You can do this in one step if you'd like,
results <- by(df, df[,c("strata","category")], function(x) lm(..., data=x))
好处是它一步完成.by
返回可能看起来有点奇怪,但它实际上只是一个带有一些特殊 print.by
的 list
使用的方法;您仍然可以根据需要像列表一样引用它.
The benefit is that it does it in one step. The by
return can look a bit odd, but it is really just a list
with some special print.by
methods being used; you can still reference it just like a list as needed.
在 dplyr
中执行此操作的另一种方法:
Another way to do this in dplyr
:
library(dplyr)
results <- df %>%
group_by(strata, category) %>%
summarize(model = list(lm(y ~ x)))
results
# # A tibble: 12 x 3
# # Groups: strata [4]
# strata category model
# <int> <chr> <list>
# 1 1 A <lm>
# 2 1 B <lm>
# 3 1 C <lm>
# 4 2 A <lm>
# 5 2 B <lm>
# 6 2 C <lm>
# 7 3 A <lm>
# 8 3 B <lm>
# 9 3 C <lm>
# 10 4 A <lm>
# 11 4 B <lm>
# 12 4 C <lm>
results$model[[1]]
# Call:
# lm(formula = y ~ x)
# Coefficients:
# (Intercept) x
# -1.121 NA
正如 Onyambu 所指出的(谢谢!),这很有效(没有 data=
),因为我们明确列出了变量,它们会被找到.例如,如果您的回归使用 .
,您可能希望使用
As pointed out by Onyambu (thank you!), this works well (without data=
) because we are explicitly listing the variables, and they will be found. If your regression uses .
, for example, you may want to formalize it a little with
results <- df %>%
group_by(strata, category) %>%
summarize(model = list(lm(y ~ ., data = cur_data())))
y~x
没有它也能工作,但 y~.
不行,所以 data=cur_data()
.
y~x
will work without it, but y~.
will not, ergo data=cur_data()
.
这篇关于在 R 中拆分为两个类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论