问题描述
限时送ChatGPT账号..我正在尝试创建一个新列,比如 test,基于 3 列的几个条件.我正在尝试仅使用 tidyverse 来实现这一目标.这是我的条件:
I am trying to create a new column, say test, with several conditions based on 3 columns. I am tryiing to achieve this with tidyverse only. Here are my conditions:
如果我在某一列中有Yes
,不管其他类别 (No/Unknown/NA
) 在同一列中的 id:previous_cabg、previous_pci、previous_ami
然后在测试变量中赋值 Yes
如果我在所有列中都有 No
相同的 id,那么为测试变量分配 NO如果我有一列 NO
和 NA/Unknown
在其他列中具有相同的 id 然后在测试中分配 No
变量如果我在所有列中都有 Yes
相同的 ID,则在测试变量中分配 Yes
如果我有是在一列和
不适用/未知对于每列中的相同ID然后分配
Yes`在测试变量中
if I have Yes
in one column, regardless of other categories (No/Unknown/NA
) in the same id across columns: previous_cabg, previous_pci, previous_ami
then assign Yes
in test variable
if I have No
in all columns for the same id then assign NO for the test variable
if I have NO
for one column and NA/Unknown
in the other columns for the same id then assign with No
in the test variable
if I have Yes
in all column for the same id then assign Yes
in the test variable
if I haveYes in one column and
NA/Unknownfor the same id in each column then assign
Yes`in test variable
这是我拥有的数据集类型:
This is the type of dataset I have:
structure(list(id = c(112139L, 43919L, 92430L, 87137L, 95417L,
66955L, 16293L, 61396L, 25379L, 79229L, 27107L, 63243L, 50627L,
17968L, 83015L, 96549L, 7332L, 4873L, 98131L, 93506L, 52894L,
59327L, 85003L, 96623L, 82999L, 65769L, 67063L, 21744L, 62961L,
2229L, 103673L, 9367L, 60215L, 74044L, 58422L, 57530L, 100399L,
46483L, 108690L, 62017L, 46467L, 79562L, 4800L, 119158L, 103222L,
32908L, 14491L, 30293L, 52558L, 122304L, 42281L, 1553L, 111771L,
23087L, 30147L, 37842L, 51552L, 20148L, 28L, 7477L), previous_cabg = structure(c(1L,
1L, 1L, NA, 1L, NA, NA, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, NA, 1L, 1L, NA, 1L, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 3L,
1L, 1L, NA, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L), .Label = c("No",
"Unknown", "Yes"), class = "factor"), previous_pci = structure(c(1L,
1L, 2L, NA, 1L, NA, NA, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
2L, NA, 2L, 1L, NA, 2L, NA, 1L, 2L, 1L, 1L, 1L, NA, 2L, 1L, 1L,
2L, 2L, NA, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 2L, 1L, 1L), .Label = c("No",
"Yes", "Unknown"), class = "factor"), previous_ami = structure(c(2L,
2L, 1L, 2L, 2L, NA, 2L, 1L, 2L, 2L, NA, 1L, 2L, 2L, 2L, 2L, 2L,
1L, NA, 1L, 2L, NA, 1L, NA, 2L, 1L, 2L, 2L, 2L, NA, 1L, 1L, 1L,
2L, 1L, NA, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 3L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, NA, 2L, 2L, 2L, 1L, 2L), .Label = c("Yes",
"No", "Unknown"), class = "factor")), row.names = c(NA, -60L), problems = structure(list(
row = c(34136L, 121773L, 121779L), col = c("1.01 Hospital identifier",
"1.01 Hospital identifier", "1.01 Hospital identifier"),
expected = c("value in level set", "value in level set",
"value in level set"), actual = c("CMH", "CMH", "CMH"), file = c("'../../data/changed/minap_2020_2021_second.csv'",
"'../../data/changed/minap_2020_2021_second.csv'", "'../../data/changed/minap_2020_2021_second.csv'"
)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"
)), class = c("tbl_df", "tbl", "data.frame"))
这是它的外观,但只有前 10 行,如果你仔细看,我在 3 列中有不同的匹配组
And this is how it looks, but only first 10 rows, if you look in detail, I have different groups of matches across the 3 columns
# A tibble: 60 x 4
id previous_cabg previous_pci previous_ami
<int> <fct> <fct> <fct>
1 112139 No No No
2 43919 No No No
3 92430 No Yes Yes
4 87137 NA NA No
5 95417 No No No
6 66955 NA NA NA
7 16293 NA NA No
8 61396 No Yes Yes
9 25379 No Yes No
10 79229 No No No
我希望只能使用 tidyverse 或 tidyverse 和 r base 的混合来解决这个问题.
I am hoping to solve this only with tidyverse or a mix of tidyverse and r base.
这是我尝试过的,但我觉得它不是那么明智.我认为这是不明智的,因为此代码将成为自动化过程的一部分,如果我将获得其他类别,而不是 Yes 和 No
,例如 Unknown
,因为 this then 出现在后面下一个数据集提取,那么我希望代码能够避免我上面给出的条件中的所有其他情况.
This is what I have tried, yet I feel it is not so wise. I believe it is not wise, since this code will be part of automation process and if I will get other categories, than Yes and No
, like Unknown
as thisn appeared later in the next dataset extracts, then I wish the code will avoid all the other cases from the conditions I have given above.
dplyr::mutate(first_attack =
dplyr::case_when(previous_cabg == 'No' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'No' | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'No' ~ 'Yes',
previous_cabg == 'No' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'No' | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'No' ~ 'Yes'
# deal with the unknown category
previous_cabg == 'Unknown' | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'No' ~ 'Yes',
previous_cabg == 'Unknown' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Unknown' | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'Unknown' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'No' ~ 'Yes',
previous_cabg == 'No' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'No' | previous_pci == 'Yes' | previous_ami == 'No' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Unknown' ~ 'Yes',
previous_cabg == 'Unknown' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Unknown' | previous_pci == 'Yes' | previous_ami == 'Unknown' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Unknown' ~ 'Yes',
previous_cabg == 'Unknown' | previous_pci == 'Yes'| previous_ami == 'Unknown' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Unknown' | previous_pci == 'Yes'| previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'No' ~ 'Yes',
previous_cabg == 'No' | previous_pci == 'Yes'| previous_ami == 'No' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'No' | previous_pci == 'Yes'| previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Unknown' ~ 'Yes',
previous_cabg == 'Unknown' | previous_pci == 'Yes'| previous_ami == 'Unknown' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Unknown' | previous_pci == 'Yes'| previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'No' | previous_pci == 'No' | previous_ami == 'No' ~ 'No',
previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'Yes' ~'Yes'
))
推荐答案
这些操作是rowwise()
,所以效率不是很高,但是tidyverse
中的这个解决方案代码>应该干净地实现你想要的.
These operations are rowwise()
, so they're not very efficient, but this solution in the tidyverse
should cleanly achieve what you want.
让我们将您的示例数据集命名为 dataset
.然后是下面的工作流程
Let us call your sample dataset by the name dataset
. Then the following workflow
library(tidyverse)
# ...
# Code to generate your 'dataset'.
# ...
# Define custom logic across a single row.
get_first_attack <- function(values_across_row) {
# "Yes" overrides all other values.
if(isTRUE(any(values_across_row == "Yes"))){
return("Yes")
}
# "No" overrides all missing values: 'NA' and "Unknown".
else if(isTRUE(any(values_across_row == "No"))) {
return("No")
}
# "Unknown" overrides all other missing values: 'NA'.
else if(isTRUE(any(values_across_row == "Unknown"))) {
return("Unknown")
}
# All values are missing: 'NA'.
else {
return(as.character(NA))
}
}
dataset %>%
# Examine row by row.
dplyr::rowwise() %>%
# Compare values across each row according to the logic in 'get_first_attack()'.
dplyr::mutate(first_attack = get_first_attack(across(previous_cabg:previous_ami))) %>%
# Exit row-wise approach, to restore efficiency.
dplyr::ungroup() %>%
# Factor 'first_attack' exactly like its neighboring column.
dplyr::mutate(first_attack = factor(first_attack, levels = levels(previous_ami)))
应该给你这些结果
# A tibble: 60 x 5
id previous_cabg previous_pci previous_ami first_attack
<int> <fct> <fct> <fct> <fct>
1 112139 No No No No
2 43919 No No No No
3 92430 No Yes Yes Yes
4 87137 NA NA No No
5 95417 No No No No
6 66955 NA NA NA NA
7 16293 NA NA No No
8 61396 No Yes Yes Yes
9 25379 No Yes No Yes
10 79229 No No No No
# ... with 50 more rows
其中 first_attack
列被恰当地定义为具有三个级别的 factor
:Yes"
、No";
和 未知"
.
where the first_attack
column is fittingly defined as a factor
with three levels: "Yes"
, "No"
, and "Unknown"
.
这篇关于仅使用 dplyr 在 3 列中应用条件时如何填充 Yes 和 No?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论