在R中,我有两个包含列表列的数据框
d1 <- data.table( group_id1=1:4 ) d1$Cat_grouped <- list(letters[1:2],letters[3:2],letters[3:6],letters[11:12] )还有
d_grouped <- data.table( group_id2=1:4 ) d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10],letters[1:2],letters[1] )我想基于d1$Cat_grouped中的向量包含在d1$Cat_grouped中的向量来合并这两个data.tables
更准确地说,可能有两个匹配条件:
a)d1$Cat_grouped每个向量的所有元素都必须在d_grouped$Cat_grouped
的匹配向量中导致以下匹配:
result_a <- data.table( group_id1=c(1,2) group_id2=c(1,1) )b)d1$Cat_grouped每个向量中的至少一个元素必须位于d_grouped$Cat_grouped
的匹配向量中导致以下匹配:
result_b <- data.table( group_id1=c(1,2,3,3), group_id2=c(1,1,1,2) )如何实现a)或b)?最好以data.table的方式.
添加了a)和b)的预期结果
向d_grouped添加了更多组,因此分组变量重叠.这打破了一些建议的解决方案
解决方案此答案侧重于问题的 a)部分.
它遵循 Harland的方法,但出于性能方面的原因,OP试图更好地利用data.table习惯用法提到他的生产数据可能包含数百万个观测值.
样本数据 library(data.table) d1 <- data.table( group_id1 = 1:4, Cat_grouped = list(letters[1:2], letters[3:2], letters[3:6], letters[11:12])) d_grouped <- data.table( group_id2 = 1:2, Cat_grouped = list(letters[1:5], letters[6:10]))结果a)
grp_cols <- c("group_id1", "group_id2") unique(d1[, .(unlist(Cat_grouped), lengths(Cat_grouped)), by = group_id1][ d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][ , .(V2, .N), by = grp_cols][V2 == N, ..grp_cols]) group_id1 group_id2 1: 1 1 2: 2 1说明
在将d1和d_grouped的列表元素扩展为长格式时,使用lengths()函数为d1确定列表元素的数量. lengths()(请注意与length()的区别)获取列表中每个元素的长度,并在R 3.2.0中引入.
在内部联接(请注意nomatch = 0L参数)之后,对grp_cols的每个组合计算结果集中的行数(使用特殊符号.N).仅考虑结果集中的计数与列表的原始长度匹配的那些行.最后,返回grp_cols的唯一组合.
结果b)结果b)可以通过省略计数内容而从上述解决方案中得出:
unique(d1[, unlist(Cat_grouped), by = group_id1][ d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][ , c("group_id1", "group_id2")])group_id1 group_id2 1: 1 1 2: 2 1 3: 3 1 4: 3 2
In R, I have two data frames that contain list columns
d1 <- data.table( group_id1=1:4 ) d1$Cat_grouped <- list(letters[1:2],letters[3:2],letters[3:6],letters[11:12] )And
d_grouped <- data.table( group_id2=1:4 ) d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10],letters[1:2],letters[1] )I would like to merge these two data.tables based on the vectors in d1$Cat_grouped being contained in the vectors in d_grouped$Cat_grouped
To be more precise, there could be two matching criteria:
a) all elements of each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_a <- data.table( group_id1=c(1,2) group_id2=c(1,1) )b) at least one of the elements in each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_b <- data.table( group_id1=c(1,2,3,3), group_id2=c(1,1,1,2) )How can I implement a) or b) ? Preferably in a data.table way.
EDIT1: added the expected results of a) and b)
EDIT2: added more groups to d_grouped, so grouping variables overlap. This breaks some of the proposed solutions
解决方案This answer focuses on part a) of the question.
It follows Harland's approach but tries to make better use of the data.table idiom for performance reasons as the OP has mentioned that his production data may contain millions of observations.
Sample data library(data.table) d1 <- data.table( group_id1 = 1:4, Cat_grouped = list(letters[1:2], letters[3:2], letters[3:6], letters[11:12])) d_grouped <- data.table( group_id2 = 1:2, Cat_grouped = list(letters[1:5], letters[6:10]))Result a)
grp_cols <- c("group_id1", "group_id2") unique(d1[, .(unlist(Cat_grouped), lengths(Cat_grouped)), by = group_id1][ d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][ , .(V2, .N), by = grp_cols][V2 == N, ..grp_cols]) group_id1 group_id2 1: 1 1 2: 2 1Explanation
While expanding the list elements of d1 and d_grouped into long format, the number of list elements is determined for d1 using the lengths() function. lengths() (note the difference to length()) gets the length of each element of a list and was introduced with R 3.2.0.
After the inner join (note the nomatch = 0L parameter), the number of rows in the result set is counted (using the specal symbol .N) for each combination of grp_cols. Only those rows are considered where the count in the result set does match the original length of the list. Finally, the unique combinations of grp_cols are returned.
Result b)Result b) can be derived from above solution by omitting the counting stuff:
unique(d1[, unlist(Cat_grouped), by = group_id1][ d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][ , c("group_id1", "group_id2")])group_id1 group_id2 1: 1 1 2: 2 1 3: 3 1 4: 3 2
更多推荐
如何基于属于另一个向量的一个向量合并向量列表?
发布评论