问题描述
限时送ChatGPT账号..我想用两个条件为每个 case
匹配 2 个 controls
:
I want to match 2 controls
for every case
with two conditions:
① age
差值应在±2之间;
① the age
difference should between ±2;
②收入
差值应在±2之间.
如果一个案例有超过 2 个 controls
,我只需要随机选择 2 个 controls
.有一个例子:
If there are more than 2 controls
for a case, I just need select 2 controls
randomly.
There is an example:
dat = structure(list(id = c(1, 2, 3, 4, 111, 222, 333, 444, 555, 666,
777, 888, 999, 1000),
age = c(10, 20, 44, 11, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18),
income = c(35, 72, 11, 35, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70),
group = c("case", "case", "case", "case", "control", "control",
"control", "control", "control", "control", "control",
"control", "control", "control")),
row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))
> dat
# A tibble: 14 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 4 11 35 case
5 111 12 37 control
6 222 11 36 control
7 333 8 33 control
8 444 12 70 control
9 555 11 34 control
10 666 22 74 control
11 777 21 70 control
12 888 18 44 control
13 999 21 76 control
14 1000 18 70 control
期待结果
对于id = 1
,匹配的控件如下,我只需要在下表中随机选择2个控件
即可.
EXPECT OUTCOME
For id = 1
, the matched controls as below, and I just need select 2 controls
randomly in the table below.
|id|age|income|group|
|:----|:----|:----|:----|
|111|12|37|control|
|222|11|36|control|
|333|8|33|control|
|555|11|34|control|
对于id = 2
,匹配的控件如下,我只需要在下表中随机选择2个控件
即可.
For id = 2
,the matched controls as below, and I just need select 2 controls
randomly in the table below.
|id|age|income|group|
|:----|:----|:----|:----|
|666|22|74|control|
|777|21|70|control|
|1000|18|70|control|
对于id = 3
,dat
中没有匹配的controls
.
对于id = 4
,匹配的控件如下,我只需要在下表中随机选择2个控件
即可.
For id = 4
, the matched controls as below, and I just need select 2 controls
randomly in the table below.
这里需要注意的一点是,我们可以发现id = 1
和id = 4
的控件有重叠部分.我不希望两个 cases
共享一个 control
,我需要的是如果 id = 1
选择 id = 111
和 id = 222
作为 control
,那么 id = 4
只能选择 id = 555
作为 control
,如果id = 1
选择id = 111
和id = 333
作为控件,则id= 4
只能选择id = 222
和id = 555
作为控件.
One thing to note here is that we can find that the controls for
id = 1
andid = 4
have overlapping parts. I don't want twocases
to share acontrol
, what I need is that ifid = 1
choosesid = 111
andid = 222
ascontrol
, thenid = 4
can only chooseid = 555
ascontrol
, and ifid = 1
choosesid = 111
andid = 333
as control, thenid = 4
can only chooseid = 222
andid = 555
as controls.
|id|age|income|group|
|:----|:----|:----|:----|
|111|12|37|control|
|222|11|36|control|
|555|11|34|control|
最终的输出可能是这样的(control
组中的id
是从满足条件的id
中随机选取的):>
The final output maybe like this(the id
in control
group is randomly selected from the id
that meets the conditions):
|id|age|income|group|
|:----|:----|:----|:----|
|1|10|35|case|
|2|20|72|case|
|3|44|11|case|
|4|11|35|case|
|111|12|37|control|
|222|11|36|control|
|333|8|33|control|
|555|11|34|control|
|777|21|70|control|
|1000|18|70|control|
注意
我查阅了一些网站,但它们不能满足我的需求.我不知道如何使用 R 代码实现我的要求.
NOTE
I've looked up some websites, but they don't meet my needs. I don't know how to implement my requirements using R code.
任何帮助将不胜感激!
1.https://stackoverflow/questions/56026700/is-there-any-package-for-case-control-matching-individual-1n-matching-in-r-n
1.https://stackoverflow/questions/56026700/is-there-any-package-for-case-control-matching-individual-1n-matching-in-r-n
2.R(或spss)中的病例对照匹配,基于年龄、性别和种族?
3.使用 ccoptimalmatch 在 R 中匹配 case-controls包
4.R 中的精确匹配
推荐答案
根据修改后的需求,我提出如下for循环
As per modified requirement, I propose the following for loop
library(dplyr, warn.conflicts = F)
dat %>%
split(.$group) %>%
list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
control$FILTER <- FALSE
control
#> # A tibble: 10 x 5
#> id age income group FILTER
#> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 111 12 37 control FALSE
#> 2 222 11 36 control FALSE
#> 3 333 8 33 control FALSE
#> 4 444 12 70 control FALSE
#> 5 555 11 34 control FALSE
#> 6 666 22 74 control FALSE
#> 7 777 21 70 control FALSE
#> 8 888 18 44 control FALSE
#> 9 999 21 76 control FALSE
#> 10 1000 18 70 control FALSE
set.seed(123)
for(i in seq_len(nrow(case))){
x <- which(between(control$age, case$age[i] -2, case$age[i] +2) &
between(control$income, case$income[i] -2, case$income[i] + 2) &
!control$FILTER)
control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}
control
#> # A tibble: 10 x 5
#> id age income group FILTER
#> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 111 12 37 control TRUE
#> 2 222 11 36 control TRUE
#> 3 333 8 33 control TRUE
#> 4 444 12 70 control FALSE
#> 5 555 11 34 control TRUE
#> 6 666 22 74 control FALSE
#> 7 777 21 70 control TRUE
#> 8 888 18 44 control FALSE
#> 9 999 21 76 control FALSE
#> 10 1000 18 70 control TRUE
bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
#> # A tibble: 10 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 4 11 35 case
#> 5 111 12 37 control
#> 6 222 11 36 control
#> 7 333 8 33 control
#> 8 555 11 34 control
#> 9 777 21 70 control
#> 10 1000 18 70 control
检查不同种子的结果
set.seed(234)
for(i in seq_len(nrow(case))){
x <- which(between(control$age, case$age[i] -2, case$age[i] +2) &
between(control$income, case$income[i] -2, case$income[i] + 2) &
!control$FILTER)
control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}
control
bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
# A tibble: 10 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 4 11 35 case
5 111 12 37 control
6 222 11 36 control
7 333 8 33 control
8 555 11 34 control
9 777 21 70 control
10 1000 18 70 control
dat
在进行 id 3 之前已修改
dat
modified before proceeding for id 3
case
和control
使用 list2env
将两个保存为单独的 dfs使用 purrr::map_df
您可以为每个案例抽取 2 行样本一次age
一次用于收入
split the data into two groups case
and control
using baseR's `split
save two as separate dfs using list2env
using purrr::map_df
you can take sample 2 rows for each case
once for age
and once for income
library(tidyverse)
dat = structure(list(id = c(1, 2, 3, 111, 222, 333, 444, 555, 666, 777, 888, 999, 1000),
age = c(10, 20, 44, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18),
income = c(35, 72, 11, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70),
group = c("case", "case", "case", "control", "control", "control",
"control", "control", "control", "control", "control",
"control", "control")),
row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"))
dat
#> # A tibble: 13 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 111 12 37 control
#> 5 222 11 36 control
#> 6 333 8 33 control
#> 7 444 12 70 control
#> 8 555 11 34 control
#> 9 666 22 74 control
#> 10 777 21 70 control
#> 11 888 18 44 control
#> 12 999 21 76 control
#> 13 1000 18 70 control
dat %>%
split(.$group) %>%
list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
set.seed(123)
bind_rows(case, map_dfr(case$age, ~ control %>% filter(between(age, .x -2, .x +2) ) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
map_dfr(case$income, ~ control %>% filter(between(income, .x -2, .x +2)) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)))
#> # A tibble: 7 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 222 11 36 control
#> 5 777 21 70 control
#> 6 111 12 37 control
#> 7 333 8 33 control
下面的代码也会做同样的事情而不保存单个 dfs
the below code will also do the same without saving individual dfs
dat %>%
split(.$group) %>%
{bind_rows(.$case,
map_dfr(.$case$age, \(.x) .$control %>% filter(between(age, .x -2, .x +2) ) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
map_dfr(.$case$income, \(.x) .$control %>% filter(between(income, .x -2, .x +2)) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)))}
这篇关于在 r 中使用多个条件将控件与案例匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论