我有两个等长的一维numpy数组id和data,其中id是重复的有序整数序列,这些整数定义了data上的子窗口.例如,
I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example,
id data 1 2 1 7 1 3 2 8 2 9 2 10 3 1 3 -10我想通过对id进行分组并采用最大值或最小值来汇总data.在SQL中,这将是典型的聚合查询,例如SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.有没有一种方法可以避免Python循环并以矢量化方式执行此操作,还是必须降到C?
I would like to aggregate data by grouping on id and taking either the max or the min. In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id. Is there a way I can avoid Python loops and do this in a vectorized manner, or do I have to drop down to C?
推荐答案最近几天,我一直在堆栈上看到一些非常相似的问题.以下代码与numpy.unique的实现非常相似,并且由于它利用了底层的numpy机制,因此它很可能会比在python循环中可以执行的任何操作都要快.
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np def group_min(groups, data): # sort with major key groups, minor key data order = np.lexsort((data, groups)) groups = groups[order] # this is only needed if groups is unsorted data = data[order] # construct an index which marks borders between groups index = np.empty(len(groups), 'bool') index[0] = True index[1:] = groups[1:] != groups[:-1] return data[index] #max is very similar def group_max(groups, data): order = np.lexsort((data, groups)) groups = groups[order] #this is only needed if groups is unsorted data = data[order] index = np.empty(len(groups), 'bool') index[-1] = True index[:-1] = groups[1:] != groups[:-1] return data[index]更多推荐
在numpy数组中按最大或最小分组
发布评论