计算R中矩阵中每个唯一列的出现的最快方法(Fastest way to count the occurrences of each unique column in a matrix in R)

我是R（和stackoverflow）的新手，我将非常感谢你的帮助。我想计算矩阵中每个唯一列的出现次数。我写了下面的代码，但它非常慢：

frequencyofequalcolumnsinmatrix = function(matrixM){ # returns a matrix columnswithfrequencyofmtxM that contains each distinct column and the frequency of each distinct columns on the last row. Hence if the last row is c(3,5,3,2), then matrixM has 3+5+3+2=13 columns; there are 4 distinct columns; and the first distinct column appears 3 times, the second distinct column appears 5 times, etc. n = nrow(matrixM) columnswithfrequencyofmtxM = c() while (ncol(matrixM)>0){ indexzero = which(apply(matrixM-matrixM[,1], 2, function(x) identical(as.vector(x),rep(0,n)))); indexnotzero = setdiff(seq(1:ncol(matrixM)),indexzero); frequencyofgivencolumn = c(matrixM[,1], length(indexzero)); #vector of length n. Coordinates 1 to nrow(matrixM) contains the coordinates of the given distinct column while coordinate nrow(matrixM)+1 contains the frequency of appearance of that column columnswithfrequencyofmtxM = cbind(columnswithfrequencyofmtxM,frequencyofgivencolumn, deparse.level=0); matrixM=matrixM[,indexnotzero]; matrixM = as.matrix(matrixM); } return(columnswithfrequencyofmtxM) }

如果我们应用矩阵'testmtx'，我们得到：

> testmtx = matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6) > frequencyofequalcolumnsinmatrix(testmtx) [,1] [,2] [,3] [1,] 1 0 1 [2,] 2 1 2 [3,] 4 1 1 [4,] 2 3 1

最后一行包含上面列的出现次数。

对我的代码不满意，我浏览了stackoverflow。我发现了以下问题：

计算每个独特元素出现次数的最快方法

结果表明，计算向量中每个唯一元素出现次数的最快方法是使用data.table（）包。这是代码：

f6 <- function(x){ data.table(x)[, .N, keyby = x] }

当我们运行它时，我们获得：

> vtr = c(1,2,3,1,1,2,4,2,4) > f6(vtr) x N 1: 1 3 2: 2 3 3: 3 1 4: 4 2

我试图修改此代码，以便在我的情况下使用它。这需要能够将vtr创建为向量，其中每个元素都是向量。但我无法做到这一点。（很可能因为在R中，c（c（1,2），c（3,4））与c（1,2,3,4）相同。

我应该尝试修改功能f6吗？如果是这样，怎么样？或者我应该采取完全不同的方法？如果是这样，哪一个？

谢谢！

I'm new to R (and to stackoverflow) and I would appreciate your help. I would like to count the number of occurences of each unique column in a matrix. I have written the following code, but it is extremely slow :

frequencyofequalcolumnsinmatrix = function(matrixM){ # returns a matrix columnswithfrequencyofmtxM that contains each distinct column and the frequency of each distinct columns on the last row. Hence if the last row is c(3,5,3,2), then matrixM has 3+5+3+2=13 columns; there are 4 distinct columns; and the first distinct column appears 3 times, the second distinct column appears 5 times, etc. n = nrow(matrixM) columnswithfrequencyofmtxM = c() while (ncol(matrixM)>0){ indexzero = which(apply(matrixM-matrixM[,1], 2, function(x) identical(as.vector(x),rep(0,n)))); indexnotzero = setdiff(seq(1:ncol(matrixM)),indexzero); frequencyofgivencolumn = c(matrixM[,1], length(indexzero)); #vector of length n. Coordinates 1 to nrow(matrixM) contains the coordinates of the given distinct column while coordinate nrow(matrixM)+1 contains the frequency of appearance of that column columnswithfrequencyofmtxM = cbind(columnswithfrequencyofmtxM,frequencyofgivencolumn, deparse.level=0); matrixM=matrixM[,indexnotzero]; matrixM = as.matrix(matrixM); } return(columnswithfrequencyofmtxM) }

If we apply on the matrix 'testmtx', we obtain:

> testmtx = matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6) > frequencyofequalcolumnsinmatrix(testmtx) [,1] [,2] [,3] [1,] 1 0 1 [2,] 2 1 2 [3,] 4 1 1 [4,] 2 3 1

where the last row contains the number of occurrences of the column above.

Unhappy with my code, I browsed through stackoverflow. I found the following Question:

Fastest way to count occurrences of each unique element

It is shown that the fastest way to count occurrences of each unique element of a vector is through the use of the data.table() package. Here is the code:

f6 <- function(x){ data.table(x)[, .N, keyby = x] }

When we run it we obtain:

> vtr = c(1,2,3,1,1,2,4,2,4) > f6(vtr) x N 1: 1 3 2: 2 3 3: 3 1 4: 4 2

I have tried to modify this code in order to use it in my case. This requires to be able to create vtr as a vector in which each element is a vector. But I haven't been able to do that.(Most likely because in R, c(c(1,2),c(3,4)) is the same as c(1,2,3,4)).

Should I try to modify the function f6? If so, how? Or should I take a completely different approach? IF so, which one?

Thank you!

最满意答案

一种简单的方法是将您的行粘贴到一个向量中，然后使用该函数。

mat <- matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6) vec <- apply(mat, 2, paste, collapse=" ") f6(vec) x N 1: 011 3 2: 121 1 3: 124 2

编辑

@RohitDas的回答让我想到，在考虑性能时，最好先检查一下。如果我采取之前在问题中显示的所有功能，OP链接在这里并添加

f7 <- table

还添加@DavidArenburg的f10建议

f10 <- function(x){ table(unlist(data.table(x)[, lapply(.SD, paste, collapse = "")])) }

结果如下：

在@MaratTalipov添加解决方案后，它是明显的赢家。直接应用于矩阵，它比所有矢量解决方案更快。

set.seed(1) testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=1000) microbenchmark( f1(apply(testmx, 2, paste, collapse=" ")), f2(apply(testmx, 2, paste, collapse=" ")), f3(apply(testmx, 2, paste, collapse=" ")), f4(apply(testmx, 2, paste, collapse=" ")), f5(apply(testmx, 2, paste, collapse=" ")), f6(apply(testmx, 2, paste, collapse=" ")), f7(apply(testmx, 2, paste, collapse=" ")), f8(apply(testmx, 2, paste, collapse=" ")), f9(apply(testmx, 2, paste, collapse=" ")), f10(testmx), f11(testmx), f12(testmx) ) Unit: microseconds expr min lq mean median uq max neval f1(apply(testmx, 2, paste, collapse = " ")) 3311.770 3511.5620 3901.0020 3612.035 3849.3600 9569.987 100 f2(apply(testmx, 2, paste, collapse = " ")) 3044.997 3263.6515 3667.9232 3430.914 3847.2430 6721.318 100 f3(apply(testmx, 2, paste, collapse = " ")) 2032.179 2118.0245 2371.8638 2213.301 2430.4155 6631.624 100 f4(apply(testmx, 2, paste, collapse = " ")) 2119.949 2218.3050 2497.1513 2286.442 2425.0260 6258.987 100 f5(apply(testmx, 2, paste, collapse = " ")) 2131.498 2221.5775 2459.9300 2309.925 2530.3115 4222.575 100 f6(apply(testmx, 2, paste, collapse = " ")) 3121.217 3367.7815 3738.3239 3486.155 3835.1175 7979.352 100 f7(apply(testmx, 2, paste, collapse = " ")) 1766.175 1832.9650 2040.5483 1889.169 2032.1795 3784.110 100 f8(apply(testmx, 2, paste, collapse = " ")) 2085.303 2169.2240 2435.6932 2237.168 2404.2380 5002.109 100 f9(apply(testmx, 2, paste, collapse = " ")) 2802.090 2988.0230 3449.0685 3056.930 3373.1710 17640.957 100 f10(testmx) 4027.017 4251.6385 4865.7036 4399.461 4848.7035 11811.581 100 f11(testmx) 500.058 549.1395 624.9526 576.279 636.1395 1176.809 100 f12(testmx) 1827.769 1886.4740 1957.0555 1902.834 1964.4270 3600.487 100

One simple way would be to just paste your rows together in to a vector and then use the function.

mat <- matrix(c(1,2,4,0,1,1,1,2,1,1,2,4,0,1,1,0,1,1), nrow=3, ncol=6) vec <- apply(mat, 2, paste, collapse=" ") f6(vec) x N 1: 011 3 2: 121 1 3: 124 2

EDIT

The answer by @RohitDas made me think, when thinking about performance it is always best to check. If I take all the functions previously shown in the question the OP linked here and add

f7 <- table

Also adding f10 suggestion by @DavidArenburg

f10 <- function(x){ table(unlist(data.table(x)[, lapply(.SD, paste, collapse = "")])) }

Here are the results:

After adding the solution by @MaratTalipov, it is the clear winner. Applied directly on the matrix it is faster than all the vector solutions.

set.seed(1) testmx <- matrix(sample(1:10, 3 * 1e3, rep=T), nrow=1000) microbenchmark( f1(apply(testmx, 2, paste, collapse=" ")), f2(apply(testmx, 2, paste, collapse=" ")), f3(apply(testmx, 2, paste, collapse=" ")), f4(apply(testmx, 2, paste, collapse=" ")), f5(apply(testmx, 2, paste, collapse=" ")), f6(apply(testmx, 2, paste, collapse=" ")), f7(apply(testmx, 2, paste, collapse=" ")), f8(apply(testmx, 2, paste, collapse=" ")), f9(apply(testmx, 2, paste, collapse=" ")), f10(testmx), f11(testmx), f12(testmx) ) Unit: microseconds expr min lq mean median uq max neval f1(apply(testmx, 2, paste, collapse = " ")) 3311.770 3511.5620 3901.0020 3612.035 3849.3600 9569.987 100 f2(apply(testmx, 2, paste, collapse = " ")) 3044.997 3263.6515 3667.9232 3430.914 3847.2430 6721.318 100 f3(apply(testmx, 2, paste, collapse = " ")) 2032.179 2118.0245 2371.8638 2213.301 2430.4155 6631.624 100 f4(apply(testmx, 2, paste, collapse = " ")) 2119.949 2218.3050 2497.1513 2286.442 2425.0260 6258.987 100 f5(apply(testmx, 2, paste, collapse = " ")) 2131.498 2221.5775 2459.9300 2309.925 2530.3115 4222.575 100 f6(apply(testmx, 2, paste, collapse = " ")) 3121.217 3367.7815 3738.3239 3486.155 3835.1175 7979.352 100 f7(apply(testmx, 2, paste, collapse = " ")) 1766.175 1832.9650 2040.5483 1889.169 2032.1795 3784.110 100 f8(apply(testmx, 2, paste, collapse = " ")) 2085.303 2169.2240 2435.6932 2237.168 2404.2380 5002.109 100 f9(apply(testmx, 2, paste, collapse = " ")) 2802.090 2988.0230 3449.0685 3056.930 3373.1710 17640.957 100 f10(testmx) 4027.017 4251.6385 4865.7036 4399.461 4848.7035 11811.581 100 f11(testmx) 500.058 549.1395 624.9526 576.279 636.1395 1176.809 100 f12(testmx) 1827.769 1886.4740 1957.0555 1902.834 1964.4270 3600.487 100

更多推荐