我有一个值数组,例如v(例如v=[1,2,3,4,5,6,7,8,9,10])和一个索引数组,例如g(例如g=[0,0,0,0,1,1,1,1,2,2]).
I have an array of values, said v, (e.g. v=[1,2,3,4,5,6,7,8,9,10]) and an array of indexes, say g (e.g. g=[0,0,0,0,1,1,1,1,2,2]).
例如,我知道如何以非常python化的方式获取每个组的第一个元素:
I know, for instance, how to take the first element of each group, in a very numpythonic way, doing:
import numpy as np v=np.array([1,2,3,4,74,73,72,71,9,10]) g=np.array([0,0,0,0,1,1,1,1,2,2]) mask=np.concatenate(([True],np.diff(g)!=0)) v[mask]返回:
array([1, 74, 9])是否有任何numpy thonic方法(避免显式循环)来获取每个子集的最大值?
Is there any numpythonic way (avoiding explicit loops) to get the maximum of each subset?
由于我收到了两个很好的答案,一个使用python map,一个使用numpy例程,并且我搜索的是性能最高的,这里是一些计时测试:
Since I received two good answers, one with the python map and one with a numpy routine, and I was searching the most performing, here some timing tests:
import numpy as np import time N=10000000 v=np.arange(N) Nelemes_per_group=10 Ngroups=N/Nelemes_per_group s=np.arange(Ngroups) g=np.repeat(s,Nelemes_per_group) start1=time.time() r=np.maximum.reduceat(v, np.unique(g, return_index=True)[1]) end1=time.time() print('END first method, T=',(end1-start1),'s') start3=time.time() np.array(list(map(np.max,np.split(v,np.where(np.diff(g)!=0)[0]+1)))) end3=time.time() print('END second method, (map returns an iterable) T=',(end3-start3),'s')结果是:
END first method, T= 1.6057236194610596 s END second method, (map returns an iterable) T= 8.346540689468384 s有趣的是,map方法的大多数放慢归因于list()调用.如果我不尝试将我的map结果转换为list(但我必须这样做,因为python3.x返回一个迭代器: docs.python/3/library/functions.html#map )
Interestingly, most of the slowdown of the map method is due to the list() call. If I do not try to reconvert my map result to a list ( but I have to, because python3.x returns an iterator: docs.python/3/library/functions.html#map )
推荐答案您可以使用np.maximum.reduceat:
>>> _, idx = np.unique(g, return_index=True) >>> np.maximum.reduceat(v, idx) array([ 4, 74, 10])有关ufunc reduceat方法工作原理的更多信息,请参见此处.
More about the workings of the ufunc reduceat method can be found here.
关于性能的评论
np.maximum.reduceat非常快.大部分时间都是在生成索引idx的.
np.maximum.reduceat is very fast. Generating the indices idx is what takes most of the time here.
虽然_, idx = np.unique(g, return_index=True)是获取索引的一种优雅方式,但并不是特别快.
While _, idx = np.unique(g, return_index=True) is an elegant way to get the indices, it is not particularly quick.
原因是np.unique需要首先对数组进行排序,复杂度为O(n log n).对于大型阵列,这比使用几个O(n)操作生成idx的成本高得多.
The reason is that np.unique needs to sort the array first, which is O(n log n) in complexity. For large arrays, this is much more expensive than using several O(n) operations to generate idx.
因此,对于大型数组,改用以下命令会更快:
Therefore, for large arrays it is much faster to use the following instead:
idx = np.concatenate([[0], 1+np.diff(g).nonzero()[0]]) np.maximum.reduceat(v, idx)更多推荐
numpy,获取最大子集
发布评论