在data.table中按组标记随机选择的N行

编程入门行业动态更新时间:2024-10-18 16:47:29

本文介绍了在data.table中按组标记随机选择的N行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

在C3列的data.table中，我要标记每个组(C1)随机选择的N行.在SO 此处，此处和此处.但是基于答案，仍然无法弄清楚如何为我的任务找到解决方案.

At the data.table in column C3 I want to flag N randomly selected rows by each group (C1). There are several similar questions have already been asked on SO here, here and here. But based on the answers still cannot figure out how to find a solution for my task.

set.seed(1) dt = data.table(C1 = c("A","A","A","B","C","C","C","D","D","D"), C2 = c(2,1,3,1,2,3,4,5,4,5)) dt C1 C2 1: A 2 2: A 1 3: A 3 4: B 1 5: C 2 6: C 3 7: C 4 8: D 5 9: D 4 10: D 5

以下是每个C1组随机选择的两行的行索引(不适用于B组)

Here are row indexes for two randomly selected rows by each group C1 (doesn't work well for group B):

dt[, sample(.I, min(.N, 2)), by = C1]$V1 [1] 1 3 3 7 5 10 9

NB:对于B，仅应选择一行，因为组B仅包含一行.

这是一种针对每个组中随机选择的行的解决方案，这通常不适用于B组:

Here is a solution for one randomly selected row in each group, which often doesn't work for group B:

dt[, C3 := .I == sample(.I, 1), by = C1] dt C1 C2 C3 1: A 2 FALSE 2: A 1 TRUE 3: A 3 FALSE 4: B 1 FALSE 5: C 2 TRUE 6: C 3 FALSE 7: C 4 FALSE 8: D 5 TRUE 9: D 4 FALSE 10: D 5 FALSE

实际上，我想将其扩展到N行.我已经尝试了(两行):

Actually I want to expand it on N rows. I've tried (for two rows):

dt[, C3 := .I==sample(.I, min(.N, 2)), by = C1]

那当然是行不通的.

非常感谢您的帮助！

推荐答案

dt[, C3 := 1:.N %in% sample(.N, min(.N, 2)), by = C1]

或使用 head ，但我认为应该慢一些

Or use head, but I think that should be slower

dt[, C3 := 1:.N %in% head(sample(.N), 2) , by = C1]

如果标记的行数不是恒定的，则可以

If the number of flagged rows is not constant you can do