我有很多条目,每个条目都是一个浮点数.这些数据x可通过迭代器访问.我需要使用选择10<y<=20,20<y<=50,....对所有条目进行分类,其中y是来自其他可迭代对象的数据.条目的数量远远大于选择的数量.最后,我想要一个像这样的字典:
I've a huge number of entries, every one is a float number. These data x are accesible with an iterator. I need to classify all the entries using selection like 10<y<=20, 20<y<=50, .... where y are data from an other iterables. The number of entries is much more than the number of selections. At the end I want a dictionary like:
{ 0: [all events with 10<x<=20], 1: [all events with 20<x<=50], ... }或类似的东西.例如,我在做:
or something similar. For example I'm doing:
for x, y in itertools.izip(variable_values, binning_values): thebin = binner_function(y) self.data[tuple(thebin)].append(x)y通常是多维的.
这非常慢,有没有更快的解决方案,例如使用numpy?我认为问题出在我使用的list.append方法而不是binner_function
This is very slow, is there a faster solution, for example with numpy? I think the problem cames from the list.append method I'm using and not from the binner_function
推荐答案在numpy中获取分配的一种快速方法是使用np.digitize:
A fast way to get the assignments in numpy is using np.digitize:
docs.scipy/doc/numpy/reference/generated/numpy.digitize.html
您仍然必须将结果分配分成几组.如果x或y是多维的,则必须首先将数组展平.然后,您可以获取唯一的bin分配,然后与np.where一起遍历这些分配,以将分配分为几组.如果bin的数量比需要合并的元素的数量小得多,这可能会更快.
You'd still have to split the resulting assignments up into groups. If x or y is multidimensional, you will have to flatten the arrays first. You could then get the unique bin assignments, and then iterate over those in conjunction with np.where to split the the assigments up into groups. This will probably be faster if the number of bins is much smaller than the number of elements that need to be binned.
作为一个微不足道的示例,您将需要针对特定问题进行调整/详细说明(但希望足以使您开始使用numpy解决方案):
As a somewhat trivial example that you will need to tweak/elaborate on for your particular problem (but is hopefully enough to get you started with with a numpy solution):
In [1]: import numpy as np In [2]: x = np.random.normal(size=(50,)) In [3]: b = np.linspace(-20,20,50) In [4]: assign = np.digitize(x,b) In [5]: assign Out[5]: array([23, 25, 25, 25, 24, 26, 24, 26, 23, 24, 25, 23, 26, 25, 27, 25, 25, 25, 25, 26, 26, 25, 25, 26, 24, 23, 25, 26, 26, 24, 24, 26, 27, 24, 25, 24, 23, 23, 26, 25, 24, 25, 25, 27, 26, 25, 27, 26, 26, 24]) In [6]: uid = np.unique(assign) In [7]: adict = {} In [8]: for ii in uid: ...: adict[ii] = np.where(assign == ii)[0] ...: In [9]: adict Out[9]: {23: array([ 0, 8, 11, 25, 36, 37]), 24: array([ 4, 6, 9, 24, 29, 30, 33, 35, 40, 49]), 25: array([ 1, 2, 3, 10, 13, 15, 16, 17, 18, 21, 22, 26, 34, 39, 41, 42, 45]), 26: array([ 5, 7, 12, 19, 20, 23, 27, 28, 31, 38, 44, 47, 48]), 27: array([14, 32, 43, 46])}要处理展平然后取消展平的numpy数组,请参见: docs.scipy/doc/numpy/reference /generation/numpy.unravel_index.html
For dealing with flattening and then unflattening numpy arrays, see: docs.scipy/doc/numpy/reference/generated/numpy.unravel_index.html
docs.scipy/doc/numpy/reference/生成/numpy.ravel_multi_index.html
更多推荐
快速分类(装箱)
发布评论