lapply

lapply - 根据当前变量创建新变量，有条件地基于第二个数据框中的信息(lapply - Create new variables, based on current variables, conditionally based on info in 2nd data frame)

我一直在做很多新项目，我正在做很多不熟悉的数据准备和管理。

我有两个数据框：1）非常大，有数千个观察和变量（df1），2）数据框列出了df1（df2）中变量子集的采集年份范围。我需要在df1中为df1中的大量变量/列创建一个新变量。为df1创建的新变量将检查是否存在值（1），收集的年份不存在值（0），或者不存在值且年份超出列出的收集范围在df2（'NA'）。

我花了几天时间阅读大量的lapply()但我似乎无法找到满足我的需求或处理复杂类型的那些，这样我就不必用暴力来做这件事了。

这是我可行的起始数据框：

grp <- c('a', 'a', 'a', 'b', 'b') year <- c(1991, 1992, 1993, 2005, 2010) v1 <- c(20.5, 30.5, 29.6, 28.7, 26.1) v2 <- c(100.0, 101.5, 105.1, 'NA', 95.0) v3 <- c(47.2, 'NA', 'NA', 'NA', 'NA') df1 <- data.frame(grp = grp, year = year, v1 = v1, v2 = v2, v3 = v3) df1 grp year v1 v2 v3 a 1991 20.5 100 47.2 a 1992 30.5 101.5 NA a 1993 29.6 105.1 NA b 2005 28.7 NA NA b 2010 26.1 95 NA

这是我的参考数据框，其中包含df1中的变量：

vars <- c('v1', 'v2', 'v3') start <- c(1989, 2004, 1980) end <- c(2015, 2011, 1994) df2 <- data.frame(vars = vars, start = start, end = end) df2 vars start end v1 1989 2015 v2 2004 2011 v3 1980 1994

我一直在用'lapply（）'学习简单的东西，比如：

test <- df1[paste0(vars, '.cov')] <- lapply(df1[vars], function(x) as.integer(x > 0))

我在R中写道，我认为是需要满足的条件类型。我将用书面英语叙述：

收集的一年中存在的值（1）

if (!is.na(x)) { x <- 1 }

在df2（0）中列出的范围内的一年中不存在值

if (is.na(x) & year %in% seq(df2$start[df2$vars == names(df1[x]), ], df2$end[df2$vars == names(df1[x]), ], 1)) { x <- 0 }

值不存在且年份超出df2（'NA'）中列出的收集范围

if (is.na(x) & !(year %in% seq(df2$start[df2$vars == names(df1[x]), ], df2$end[df2$vars == names(df1[x]), ], 1))) { x <- 'NA' }

我在语法和索引方面做得最好，但我们很快就走出了自己的舒适区。

运行条件检查后，所需的输出/修改后的df1应如下所示：

grp year v1 v2 v3 v1.cov v2.cov v3.cov a 1991 20.5 100 47.2 1 1 1 a 1992 30.5 101.5 NA 1 1 0 a 1993 29.6 105.1 NA 1 1 0 b 2005 28.7 NA NA 1 0 NA b 2010 26.1 95 NA 1 1 NA

我对各种解决方案持开放态度，但这似乎是可能的路径。再次感谢所有的帮助。我是一位经验丰富的R建模师/科学家，但在过去的一个月里，我已经学到了很多数据准备，'data.table'和'dplyr'，并提供了所有帮助。

I've been working on a lot of new projects where I'm doing lots of unfamiliar data prep and management.

I have two data frames: 1) that is very large with thousands of observations and variables (df1), and 2) a data frame that lists ranges of collection years for a subset of variables in df1 (df2). I need to create a new variable in df1 for a large subset of variables/columns in df1. The new variables created for df1 will check if a value is present (1), a value isn't present for a year that was collected (0), or a value isn't present and the year falls outside of the collection range listed in df2 ('NA').

I've spent a couple days reading a ton of lapply() but I can't seem to find one that addresses my needs or deals with type of complexity such that I don't have to do this with brute force.

Here is my workable starting data frame:

grp <- c('a', 'a', 'a', 'b', 'b') year <- c(1991, 1992, 1993, 2005, 2010) v1 <- c(20.5, 30.5, 29.6, 28.7, 26.1) v2 <- c(100.0, 101.5, 105.1, 'NA', 95.0) v3 <- c(47.2, 'NA', 'NA', 'NA', 'NA') df1 <- data.frame(grp = grp, year = year, v1 = v1, v2 = v2, v3 = v3) df1 grp year v1 v2 v3 a 1991 20.5 100 47.2 a 1992 30.5 101.5 NA a 1993 29.6 105.1 NA b 2005 28.7 NA NA b 2010 26.1 95 NA

Here is my reference data frame with coverages for variables in df1:

vars <- c('v1', 'v2', 'v3') start <- c(1989, 2004, 1980) end <- c(2015, 2011, 1994) df2 <- data.frame(vars = vars, start = start, end = end) df2 vars start end v1 1989 2015 v2 2004 2011 v3 1980 1994

I've been learning with simple stuff with 'lapply()' like:

test <- df1[paste0(vars, '.cov')] <- lapply(df1[vars], function(x) as.integer(x > 0))

I wrote out in R, what I think are, the types of conditions that need to be met. I'll narrate with written English:

a value is present for a year that was collected (1)

if (!is.na(x)) { x <- 1 }

a value isn't present for a year that falls within the range listed in df2 (0)

if (is.na(x) & year %in% seq(df2$start[df2$vars == names(df1[x]), ], df2$end[df2$vars == names(df1[x]), ], 1)) { x <- 0 }

a value isn't present and the year falls outside of the collection range listed in df2 ('NA')

if (is.na(x) & !(year %in% seq(df2$start[df2$vars == names(df1[x]), ], df2$end[df2$vars == names(df1[x]), ], 1))) { x <- 'NA' }

I did my best with the syntax and indexing, but we're rapidly getting out of my comfort zone.

After running the conditional checks the desired output/modified df1 should look like:

grp year v1 v2 v3 v1.cov v2.cov v3.cov a 1991 20.5 100 47.2 1 1 1 a 1992 30.5 101.5 NA 1 1 0 a 1993 29.6 105.1 NA 1 1 0 b 2005 28.7 NA NA 1 0 NA b 2010 26.1 95 NA 1 1 NA

I'm open to a variety of solutions but this seemed the likely path to move through. Thanks again for all the help. I'm an experienced R modeler/scientist but I've learned so much data prep, 'data.table', and 'dplyr' in the past month with all your help.

最满意答案

使用data.table：

library(data.table) setDT(df1) DT = melt(df1, id = c("grp", "year"), meas = patterns("^v"))[, value := type.convert(as.character(value))] # mark based on whether found or not within collection periods DT[df2, on=.(variable = vars, year >= start, year <= end), found := as.integer(!is.na(value))] # also mark if found outside collection periods DT[!is.na(value) & is.na(found), found := 1L ]

这使

grp year variable value found 1: a 1991 v1 20.5 1 2: a 1992 v1 30.5 1 3: a 1993 v1 29.6 1 4: b 2005 v1 28.7 1 5: b 2010 v1 26.1 1 6: a 1991 v2 100.0 1 7: a 1992 v2 101.5 1 8: a 1993 v2 105.1 1 9: b 2005 v2 NA 0 10: b 2010 v2 95.0 1 11: a 1991 v3 47.2 1 12: a 1992 v3 NA 0 13: a 1993 v3 NA 0 14: b 2005 v3 NA NA 15: b 2010 v3 NA NA

（ type.convert用于覆盖OP对字符串'NA'的缺失数据的编码。）

melt步骤在这里才有意义，因为变量看起来是相同的类型（数字）。如果不是，可以通过循环遍历每个列来完成类似的操作：

setDT(df1) setDT(df2) for (v in unique(df2$vars)){ df1[, (v) := type.convert(as.character(get(v)))] fcol = paste0("found.",v) df1[df2[vars == v], on=.(year >= start, year <= end), (fcol) := as.integer(!is.na(get(v)))] df1[!is.na(get(v)) & is.na(get(fcol)), (fcol) := 1L ] } grp year v1 v2 v3 found.v1 found.v2 found.v3 1: a 1991 20.5 100.0 47.2 1 1 1 2: a 1992 30.5 101.5 NA 1 1 0 3: a 1993 29.6 105.1 NA 1 1 0 4: b 2005 28.7 NA NA 1 0 NA 5: b 2010 26.1 95.0 NA 1 1 NA

With data.table:

library(data.table) setDT(df1) DT = melt(df1, id = c("grp", "year"), meas = patterns("^v"))[, value := type.convert(as.character(value))] # mark based on whether found or not within collection periods DT[df2, on=.(variable = vars, year >= start, year <= end), found := as.integer(!is.na(value))] # also mark if found outside collection periods DT[!is.na(value) & is.na(found), found := 1L ]

which gives

grp year variable value found 1: a 1991 v1 20.5 1 2: a 1992 v1 30.5 1 3: a 1993 v1 29.6 1 4: b 2005 v1 28.7 1 5: b 2010 v1 26.1 1 6: a 1991 v2 100.0 1 7: a 1992 v2 101.5 1 8: a 1993 v2 105.1 1 9: b 2005 v2 NA 0 10: b 2010 v2 95.0 1 11: a 1991 v3 47.2 1 12: a 1992 v3 NA 0 13: a 1993 v3 NA 0 14: b 2005 v3 NA NA 15: b 2010 v3 NA NA

(type.convert is used to override OP's encoding of missing data with string 'NA'.)

The melt step only makes sense here because the variables seem to be of the same type (numeric). If they aren't, something similar can be done by looping over each column:

setDT(df1) setDT(df2) for (v in unique(df2$vars)){ df1[, (v) := type.convert(as.character(get(v)))] fcol = paste0("found.",v) df1[df2[vars == v], on=.(year >= start, year <= end), (fcol) := as.integer(!is.na(get(v)))] df1[!is.na(get(v)) & is.na(get(fcol)), (fcol) := 1L ] } grp year v1 v2 v3 found.v1 found.v2 found.v3 1: a 1991 20.5 100.0 47.2 1 1 1 2: a 1992 30.5 101.5 NA 1 1 0 3: a 1993 29.6 105.1 NA 1 1 0 4: b 2005 28.7 NA NA 1 0 NA 5: b 2010 26.1 95.0 NA 1 1 NA

更多推荐

lapply

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表