确定元素是否存在于矢量中的最有效方法(Most efficient way to determine if element exists in a vector)

我有几种算法取决于确定元素是否存在于矢量中的效率。在我看来， %in% （相当于is.element() ）应该是最有效的，因为它只是返回一个布尔值。在测试了几种方法之后，令我惊讶的是，这些方法迄今为止效率最低。以下是我的分析（随着载体大小的增加，结果变得更糟）：

EfficiencyTest <- function(n, Lim) { samp1 <- sample(Lim, n) set1 <- sample(Lim, Lim) print(system.time(for(i in 1:n) {which(set1==samp1[i])})) print(system.time(for(i in 1:n) {samp1[i] %in% set1})) print(system.time(for(i in 1:n) {is.element(samp1[i], set1)})) print(system.time(for(i in 1:n) {match(samp1[i], set1)})) a <- system.time(set1 <- sort(set1)) b <- system.time(for (i in 1:n) {BinVecCheck(samp1[i], set1)}) print(a+b) } > EfficiencyTest(10^3, 10^5) user system elapsed 0.29 0.11 0.40 user system elapsed 19.79 0.39 20.21 user system elapsed 19.89 0.53 20.44 user system elapsed 20.04 0.28 20.33 user system elapsed 0.02 0.00 0.03

其中BinVecCheck是我写的返回TRUE / FALSE的二分搜索算法。请注意，我包含了使用最终方法对矢量进行排序所需的时间。这是二进制搜索的代码：

BinVecCheck <- function(tar, vec) { if (tar==vec[1] || tar==vec[length(vec)]) {return(TRUE)} size <- length(vec) size2 <- trunc(size/2) dist <- (tar - vec[size2]) if (dist > 0) { lower <- size2 - 1L upper <- size } else { lower <- 1L upper <- size2 + 1L } while (size2 > 1 && !(dist==0)) { size2 <- trunc((upper-lower)/2) temp <- lower+size2 dist <- (tar - vec[temp]) if (dist > 0) { lower <- temp-1L } else { upper <- temp+1L } } if (dist==0) {return(TRUE)} else {return(FALSE)} }

平台信息：

> sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

题

是否有更有效的方法来确定元素是否存在于R中的向量中？例如，Python 集合函数是否有一个等效的R函数，可以大大提高这种方法？此外，为什么%in%等等，即使与提供更多信息的which函数（它不仅确定存在性，还提供所有真实账户的指数）相比，效率如此低下？

I have several algorithms that depend on the efficiency of determining whether an element exists in a vector or not. It seems to me that %in% (which is equivalent to is.element()) should be the most efficient as it simply returns a Boolean value. After testing several methods, to my surprise, those methods are by far the most inefficient. Below is my analysis (the results get worse as the size of the vectors increase):

EfficiencyTest <- function(n, Lim) { samp1 <- sample(Lim, n) set1 <- sample(Lim, Lim) print(system.time(for(i in 1:n) {which(set1==samp1[i])})) print(system.time(for(i in 1:n) {samp1[i] %in% set1})) print(system.time(for(i in 1:n) {is.element(samp1[i], set1)})) print(system.time(for(i in 1:n) {match(samp1[i], set1)})) a <- system.time(set1 <- sort(set1)) b <- system.time(for (i in 1:n) {BinVecCheck(samp1[i], set1)}) print(a+b) } > EfficiencyTest(10^3, 10^5) user system elapsed 0.29 0.11 0.40 user system elapsed 19.79 0.39 20.21 user system elapsed 19.89 0.53 20.44 user system elapsed 20.04 0.28 20.33 user system elapsed 0.02 0.00 0.03

Where BinVecCheck is a binary search algorithm that I wrote that returns TRUE/FALSE. Note that I include the time it takes to sort the vector with the final method. Here is the code for the binary search:

BinVecCheck <- function(tar, vec) { if (tar==vec[1] || tar==vec[length(vec)]) {return(TRUE)} size <- length(vec) size2 <- trunc(size/2) dist <- (tar - vec[size2]) if (dist > 0) { lower <- size2 - 1L upper <- size } else { lower <- 1L upper <- size2 + 1L } while (size2 > 1 && !(dist==0)) { size2 <- trunc((upper-lower)/2) temp <- lower+size2 dist <- (tar - vec[temp]) if (dist > 0) { lower <- temp-1L } else { upper <- temp+1L } } if (dist==0) {return(TRUE)} else {return(FALSE)} }

Platform Info:

> sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Question

Is there a more efficient way of determining whether an element exists in a vector in R? For example, is there an equivalent R function to the Python set function, that greatly improves on this approach? Also, why is %in%, and the like, so inefficient even when compared to the which function which gives more information (not only does it determine existence, but it also gives the indices of all true accounts)?

最满意答案

我的测试并未显示出你的所有主张，但似乎（？）是由于跨平台差异（这使问题更加神秘，可能值得在r-devel@r-project.org上r-devel@r-project.org ，虽然可能不是，因为fastmatch解决方案无论如何支配......）

n <- 10^3; Lim <- 10^5 set.seed(101) samp1 <- sample(Lim,n) set1 <- sample(Lim,Lim) library("rbenchmark") library("fastmatch") `%fin%` <- function(x, table) { stopifnot(require(fastmatch)) fmatch(x, table, nomatch = 0L) > 0L } benchmark(which=sapply(samp1,function(x) which(set1==x)), infun=sapply(samp1,function(x) x %in% set1), fin= sapply(samp1,function(x) x %fin% set1), brc= sapply(samp1,BinVecCheck,vec=sort(set1)), replications=20, columns = c("test", "replications", "elapsed", "relative")) ## test replications elapsed relative ## 4 brc 20 0.871 2.329 ## 3 fin 20 0.374 1.000 ## 2 infun 20 6.480 17.326 ## 1 which 20 10.634 28.433

这表示%in%速度大约是which两倍 - 您的BinVecCheck函数的性能提高了7倍，但这里的fastmatch解决方案获得了另一个因子2.我不知道专用的Rcpp实现是否可以做得更好。事实上，即使运行你的代码，我也会得到不同的答案：

## user system elapsed (which) ## 0.488 0.096 0.586 ## user system elapsed (%in%) ## 0.184 0.132 0.315 ## user system elapsed (is.element) ## 0.188 0.124 0.313 ## user system elapsed (match) ## 0.148 0.164 0.312 ## user system elapsed (BinVecCheck) ## 0.048 0.008 0.055

更新：关于r-devel Peter Dalgaard通过指向R NEWS条目解释了平台差异（这是R版本差异，而不是OS差异）：

由于Haverty的PR＃16491， x match(x, table)的速度更快，有时是一个数量级，当x的长度为1且不可match(x, table)的数量不变时。

sessionInfo() ## R Under development (unstable) (2015-10-23 r69563) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu precise (12.04.5 LTS)

My tests aren't bearing out all of your claims, but that seems (?) to be due to cross-platform differences (which makes the question even more mysterious, and possibly worth taking up on r-devel@r-project.org, although maybe not since the fastmatch solution below dominates anyway ...)

n <- 10^3; Lim <- 10^5 set.seed(101) samp1 <- sample(Lim,n) set1 <- sample(Lim,Lim) library("rbenchmark") library("fastmatch") `%fin%` <- function(x, table) { stopifnot(require(fastmatch)) fmatch(x, table, nomatch = 0L) > 0L } benchmark(which=sapply(samp1,function(x) which(set1==x)), infun=sapply(samp1,function(x) x %in% set1), fin= sapply(samp1,function(x) x %fin% set1), brc= sapply(samp1,BinVecCheck,vec=sort(set1)), replications=20, columns = c("test", "replications", "elapsed", "relative")) ## test replications elapsed relative ## 4 brc 20 0.871 2.329 ## 3 fin 20 0.374 1.000 ## 2 infun 20 6.480 17.326 ## 1 which 20 10.634 28.433

This says that %in% is about twice as fast as which -- your BinVecCheck function is 7x better, but the fastmatch solution from here gets another factor of 2. I don't know if a specialized Rcpp implementation could do better or not ... In fact, I get different answers even when running your code:

## user system elapsed (which) ## 0.488 0.096 0.586 ## user system elapsed (%in%) ## 0.184 0.132 0.315 ## user system elapsed (is.element) ## 0.188 0.124 0.313 ## user system elapsed (match) ## 0.148 0.164 0.312 ## user system elapsed (BinVecCheck) ## 0.048 0.008 0.055

update: on r-devel Peter Dalgaard explains the platform discrepancy (which is an R version difference, not an OS difference) by pointing to the R NEWS entry:

match(x, table) is faster, sometimes by an order of magnitude, when x is of length one and incomparables is unchanged, thanks to Haverty's PR#16491.

sessionInfo() ## R Under development (unstable) (2015-10-23 r69563) ## Platform: i686-pc-linux-gnu (32-bit) ## Running under: Ubuntu precise (12.04.5 LTS)

更多推荐