考虑
target <- "vs" value <- 1 library(data.table) dt <- as.data.table(head(mtcars))所以我试图将列名和值作为变量传递到 data.table 环境中的 j 表达式中,这相当于
So I'm trying to pass both column name and a value as variables into the j expression within data.table environment, something that would be equivalent to
dt[, vs == 1] # [1] FALSE FALSE TRUE TRUE FALSE TRUE如果只有值是变量,它工作得很好
If only the value is the variable, it works nicely
dt[, vs == value] # [1] FALSE FALSE TRUE TRUE FALSE TRUE当它是一个变量时,我们也可以在 data.table 范围内调用该列
We can also call the column within the data.table scope when it's a variable
dt[, target, with = FALSE] # vs # 1: 0 # 2: 0 # 3: 1 # 4: 1 # 5: 0 # 6: 1但我不知道如何以简单的方式将两者结合起来
注意:我很清楚我可以这样做:
Note: I'm well aware that I can simply do:
dt[[target]] == value # [1] FALSE FALSE TRUE TRUE FALSE TRUE但我需要在数据表范围内使用它,因此我可以通过引用修改其他列,例如
But I need it within the data table scope so I could modify other columns by reference, something like
dt[, NEWCOL := sum(vs == 1), by = am]所以当列名和值都是变量时,这是我的尝试
So here are my tries when both column name and the value are variables
dt[, target == value, with = FALSE] # Null data.table (0 rows and 0 cols) dt[, target == value] # [1] FALSE dt[, (target) == value] # [1] FALSE dt[, .(target == value)] # V1 # 1: FALSE dt[, eval(target) == value] # [1] FALSE dt[target %in% value] ## Empty data.table (0 rows) of 11 cols: mpg,cyl,disp,hp,drat,wt...最终我想出了
dt[, .SD[[target]] == value] # [1] FALSE FALSE TRUE TRUE FALSE TRUE但是效率很低,这里有一个简单的基准
but it is very inefficient, here's a simple benchmark
set.seed(123) n <- 1e6 dt <- data.table(vs = sample(1L:30L, n, replace = TRUE), am = seq_len(n)) system.time(dt[, NEWCOL := sum(.SD[[target]] == value), by = am]) # user system elapsed # 13.00 0.02 13.12 system.time(dt[, NEWCOL2 := sum(vs == value), by = am]) # user system elapsed # 0.82 0.00 0.83问题:我在这里想念的还有什么更好的方法吗?一些更惯用或更有效的东西
Question: Is there any better way of doing this that I'm missing here? Something either more idiomatic or much more efficient
编辑
最初我正在寻找一些惯用的东西,所以我认为@GGrothendieck 使用 get 的简单解决方案是一个,但令人惊讶的是,所有@Richard 版本甚至都击败了 ins't 对列名进行任何评估
Originally I was looking for something idiomatic, so I thought @GGrothendieck simple solution using get was the one, but surprisingly all @Richard version are beating even the version that ins't doing any evaluation of the column name
set.seed(123) n <- 1e7 dt <- data.table(vs = sample(1L:30L, n, replace = TRUE), am = seq_len(n)) cl <- substitute( x == y, list(x = as.name(target), y = value) ) cl2 <- call("==", as.name(target), value) system.time(dt[, NEWCOL := sum(vs == value), by = am]) # user system elapsed # 0.83 0.00 0.82 system.time(dt[, NEWCOL1 := sum(.SD[[target]] == value), by = am]) # user system elapsed # 8.97 0.00 8.97 system.time(dt[, NEWCOL2 := sum(get(target) == value), by = am]) # user system elapsed # 2.35 0.00 2.37 system.time(dt[, NEWCOL3 := sum(eval(cl)), by = am]) # user system elapsed # 0.69 0.02 0.71 system.time(dt[, NEWCOL4 := sum(eval(cl2)), by = am]) # user system elapsed # 0.76 0.00 0.77 system.time(dt[, NEWCOL5 := sum(eval(as.name(target)) == value), by = am]) # user system elapsed # 0.78 0.00 0.78 推荐答案这是一种可能的替代方案.
Here is one possible alternative.
target <- "vs" value <- 1 dt <- as.data.table(head(mtcars))就代码而言,它不一定更简单,但我们可以设置一个未评估的调用 cl 定义在 dt 范围之外,该调用将在数据表的内部进行评估环境.
In terms of code it's not necessarily simpler, but we can set up an unevaluated call cl defined outside the scope of dt which is to be evaluated inside the data table's environment.
cl <- substitute( x == y, list(x = as.name(target), y = value) )substitute() 对于较长的表达式可能是必需的.但在这种情况下,call() 会缩短代码并创建相同的 cl 结果.所以 cl 也可以是
substitute() might be necessary for longer expressions. But in this case, call() would shorten the code and create the same cl result. And so cl could also be
cl <- call("==", as.name(target), value)现在我们可以在 dt 中评估 cl.在您的示例中,这似乎工作正常.
Now we can evaluate cl inside dt. On your example this seems to work fine.
dt[, NEWCOL := sum(eval(cl)), by = am][] # mpg cyl disp hp drat wt qsec vs am gear carb NEWCOL # 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1 # 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1 # 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 # 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2 # 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2 # 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2考虑了一分钟后,我不确定是否需要替换 value,因此以下方法也有效.但正如 David 所说,第一种方法更省时.
After thinking about this for a minute, I'm not sure value needed to be substituted, and hence the following also works. But as David notes, the first approach is more time efficient.
dt[, eval(as.name(target)) == value] # [1] FALSE FALSE TRUE TRUE FALSE TRUE更多推荐
在 `data.table` 中的 `j` 表达式中评估列名和目标值
发布评论