在数组中找到唯一值的最快方法(Fastest way to find unique values in an array)

我试图找到一个在数组中找到唯一值的最快方法，并且删除0作为唯一值的可能性。

现在我有两个解决方案：

result1 = setxor(0, dataArray(1:end,1)); % This gives the correct solution result2 = unique(dataArray(1:end,1)); % This solution is faster but doesn't give the same result as result1

dataArray相当于：

dataArray = [0 0; 0 2; 0 4; 0 6; 1 0; 1 2; 1 4; 1 6; 2 0; 2 2; 2 4; 2 6]; % This is a small array, but in my case there are usually over 10 000 lines.

所以在这种情况下， result1等于[1; 2] [1; 2]并且result2等于[0; 1; 2] [0; 1; 2] [0; 1; 2] 。 unique功能更快，但我不希望0被考虑。有没有办法做到这一点unique ，不认为0作为一个独特的价值？还有另一种选择吗？

编辑

我想要解决各种问题。

clc dataArray = floor(10*rand(10e3,10)); dataArray(mod(dataArray(:,1),3)==0)=0; % Initial tic for ii = 1:10000 FCT1 = setxor(0, dataArray(:,1)); end toc % My solution tic for ii = 1:10000 FCT2 = unique(dataArray(dataArray(:,1)>0,1)); end toc % Pursuit solution tic for ii = 1:10000 FCT3 = unique(dataArray(:, 1)); FCT3(FCT3==0) = []; end toc % Pursuit solution with chappjc comment tic for ii = 1:10000 FCT32 = unique(dataArray(:, 1)); FCT32 = FCT32(FCT32~=0); end toc % chappjc solution tic for ii = 1:10000 FCT4 = setdiff(unique(dataArray(:,1)),0); end toc % chappjc 2nd solution tic for ii = 1:10000 FCT5 = find(accumarray(dataArray(:,1)+1,1))-1; FCT5 = FCT5(FCT5>0); end toc

结果是：

Elapsed time is 5.153571 seconds. % FCT1 Initial Elapsed time is 3.837637 seconds. % FCT2 My solution Elapsed time is 3.464652 seconds. % FCT3 Pursuit solution Elapsed time is 3.414338 seconds. % FCT32 Pursuit solution with chappjc comment Elapsed time is 4.097164 seconds. % FCT4 chappjc solution Elapsed time is 0.936623 seconds. % FCT5 chappjc 2nd solution

但是， sparse和accumarray的解决方案只能使用integer 。这些解决方案不会与double工作。

I'm trying to find a fastest way for finding unique values in a array and to remove 0 as a possibility of unique value.

Right now I have two solutions:

result1 = setxor(0, dataArray(1:end,1)); % This gives the correct solution result2 = unique(dataArray(1:end,1)); % This solution is faster but doesn't give the same result as result1

dataArray is equivalent to :

dataArray = [0 0; 0 2; 0 4; 0 6; 1 0; 1 2; 1 4; 1 6; 2 0; 2 2; 2 4; 2 6]; % This is a small array, but in my case there are usually over 10 000 lines.

So in this case, result1 is equal to [1; 2] and result2 is equal to [0; 1; 2]. The unique function is faster but I don't want 0 to be considered. Is there a way to do this with unique and not consider 0 as a unique value? Is there an another alternative?

EDIT

I wanted to time the various solutions.

clc dataArray = floor(10*rand(10e3,10)); dataArray(mod(dataArray(:,1),3)==0)=0; % Initial tic for ii = 1:10000 FCT1 = setxor(0, dataArray(:,1)); end toc % My solution tic for ii = 1:10000 FCT2 = unique(dataArray(dataArray(:,1)>0,1)); end toc % Pursuit solution tic for ii = 1:10000 FCT3 = unique(dataArray(:, 1)); FCT3(FCT3==0) = []; end toc % Pursuit solution with chappjc comment tic for ii = 1:10000 FCT32 = unique(dataArray(:, 1)); FCT32 = FCT32(FCT32~=0); end toc % chappjc solution tic for ii = 1:10000 FCT4 = setdiff(unique(dataArray(:,1)),0); end toc % chappjc 2nd solution tic for ii = 1:10000 FCT5 = find(accumarray(dataArray(:,1)+1,1))-1; FCT5 = FCT5(FCT5>0); end toc

And the results:

Elapsed time is 5.153571 seconds. % FCT1 Initial Elapsed time is 3.837637 seconds. % FCT2 My solution Elapsed time is 3.464652 seconds. % FCT3 Pursuit solution Elapsed time is 3.414338 seconds. % FCT32 Pursuit solution with chappjc comment Elapsed time is 4.097164 seconds. % FCT4 chappjc solution Elapsed time is 0.936623 seconds. % FCT5 chappjc 2nd solution

However, the solution with sparse and accumarray only works with integer. These solutions won't work with double.

最满意答案

这里有一个古怪的建议，使用弗洛里斯的测试数据进行验证：

a = floor(10*rand(100000, 1)); a(mod(a,3)==0)=0; result = find(accumarray(nonzeros(a(:,1))+1,1))-1;

感谢Luis nonzeros指出使用nonzeros ，没有必要执行result = result(result>0) ！

请注意，此解决方案需要整数值数据（不一定是整数数据类型，但不包含小数部分）。比较浮点值的相等性，如同unique那样，是危险的。看到这里和这里。

原创建议：结合unique与setdiff ：

result = setdiff(unique(a(:,1)),0)

或者在unique后删除逻辑索引：

result = unique(a(:,1)); result = result(result>0);

由于对于大型数据集而言效率非常低，我通常不希望将[]分配为（ result(result==0)=[]; ）。

unique后删除零应该更快，因为它在较少的数据上运行（除非每个元素都是唯一的，否则如果/ dataArray非常短）。

Here's a wacky suggestion with accumarray, demonstrated using Floris' test data:

a = floor(10*rand(100000, 1)); a(mod(a,3)==0)=0; result = find(accumarray(nonzeros(a(:,1))+1,1))-1;

Thanks to Luis Mendo for pointing out that with nonzeros, it is not necessary to perform result = result(result>0)!

Note that this solution requires integer-valued data (not necessarily an integer data type, but just not with decimal components). Comparing floating point values for equality, as unique would do, is perilous. See here and here.

Original suggestion: Combine unique with setdiff:

result = setdiff(unique(a(:,1)),0)

Or remove with logical indexing after unique:

result = unique(a(:,1)); result = result(result>0);

I generally prefer not to assign [] as in (result(result==0)=[];) since it gets very inefficient for large data sets.

Removing zeros after unique should be faster since the it operates on less data (unless every element is unique, OR if a/dataArray is very short).

更多推荐