有效地使用R中的集合(efficiently working with sets in R)

背景：

我正在处理R中的组合问题。对于给定的集合列表，我需要生成每组的所有对而不产生重复。

例：

initial_list_of_sets <- list() initial_list_of_sets[[1]] <- c(1,2,3) initial_list_of_sets[[2]] <- c(2,3,4) initial_list_of_sets[[3]] <- c(3,2) initial_list_of_sets[[4]] <- c(5,6,7) get_pairs(initial_list_of_sets) # should return (1 2),(1 3),(2 3),(2 4),(3 4),(5 6),(5 7),(6 7)

请注意，结果中不包括（3 2），因为它在数学上等于（2 3）。

到目前为止我的（工作但效率低下）方法：

# checks if sets contain a_set contains <- function(sets, a_set){ for (existing in sets) { if (setequal(existing, a_set)) { return(TRUE) } } return(FALSE) } get_pairs <- function(from_sets){ all_pairs <- list() for (a_set in from_sets) { # generate all pairs for current set pairs <- combn(x = a_set, m = 2, simplify = FALSE) for (pair in pairs) { # only add new pairs if they are not yet included in all_pairs if (!contains(all_pairs, pair)) { all_pairs <- c(all_pairs, list(pair)) } } } return(all_pairs) }

我的问题：

当我处理数学集时，我不能使用%in%运算符而不是my contains函数，因为那时（2 3）和（3 2）将是不同的对。但是，迭代contains所有现有集合似乎效率很低。有没有更好的方法来实现这个功能？

Background:

I am dealing with a combinatorial problem in R. For a given list of sets I need to generate all pairs per set without producing duplicates.

Example:

initial_list_of_sets <- list() initial_list_of_sets[[1]] <- c(1,2,3) initial_list_of_sets[[2]] <- c(2,3,4) initial_list_of_sets[[3]] <- c(3,2) initial_list_of_sets[[4]] <- c(5,6,7) get_pairs(initial_list_of_sets) # should return (1 2),(1 3),(2 3),(2 4),(3 4),(5 6),(5 7),(6 7)

Please note that (3 2) is not included in the results, as it is mathematically equal to (2 3).

My (working but inefficient) approach so far:

# checks if sets contain a_set contains <- function(sets, a_set){ for (existing in sets) { if (setequal(existing, a_set)) { return(TRUE) } } return(FALSE) } get_pairs <- function(from_sets){ all_pairs <- list() for (a_set in from_sets) { # generate all pairs for current set pairs <- combn(x = a_set, m = 2, simplify = FALSE) for (pair in pairs) { # only add new pairs if they are not yet included in all_pairs if (!contains(all_pairs, pair)) { all_pairs <- c(all_pairs, list(pair)) } } } return(all_pairs) }

My question:

As I am dealing with mathematical sets I can't use the %in% operator instead of my contains function, because then (2 3) and (3 2) would be different pairs. However it seems very inefficient to iterate over all existing sets in contains. Is there a better way to implement this function?

最满意答案

也许您可以将get_pairs函数重写为如下所示：

myFun <- function(inlist) { unique(do.call(rbind, lapply(inlist, function(x) t(combn(sort(x), 2))))) }

这是一个快速的时间比较。

n <- 100 set.seed(1) x <- sample(2:8, n, TRUE) initial_list_of_sets <- lapply(x, function(y) sample(100, y)) system.time(get_pairs(initial_list_of_sets)) # user system elapsed # 1.964 0.000 1.959 system.time(myFun(initial_list_of_sets)) # user system elapsed # 0.012 0.000 0.014

如果需要，您可以按行split矩阵以获取列表。

例如：

myFun <- function(inlist) { temp <- unique(do.call(rbind, lapply(inlist, function(x) t(combn(sort(x), 2))))) lapply(1:nrow(temp), function(x) temp[x, ]) }

Perhaps you can rewrite your get_pairs function as something like the following:

myFun <- function(inlist) { unique(do.call(rbind, lapply(inlist, function(x) t(combn(sort(x), 2))))) }

Here's a quick time comparison.

n <- 100 set.seed(1) x <- sample(2:8, n, TRUE) initial_list_of_sets <- lapply(x, function(y) sample(100, y)) system.time(get_pairs(initial_list_of_sets)) # user system elapsed # 1.964 0.000 1.959 system.time(myFun(initial_list_of_sets)) # user system elapsed # 0.012 0.000 0.014

If needed, you can split the matrix by rows to get your list.

Eg:

myFun <- function(inlist) { temp <- unique(do.call(rbind, lapply(inlist, function(x) t(combn(sort(x), 2))))) lapply(1:nrow(temp), function(x) temp[x, ]) }

更多推荐