将数值归入一定范围内的一种好方法是什么?例如,假设我有一个值列表,我想按它们的范围将它们分成N个箱.现在,我做这样的事情:
What is a good way to bin numerical values into a certain range? For example, suppose I have a list of values and I want to bin them into N bins by their range. Right now, I do something like this:
from scipy import * num_bins = 3 # number of bins to use values = # some array of integers... min_val = min(values) - 1 max_val = max(values) + 1 my_bins = linspace(min_val, max_val, num_bins) # assign point to my bins for v in values: best_bin = min_index(abs(my_bins - v))其中min_index返回最小值的索引.这样的想法是,您可以通过查看与之具有最小差异的仓来找到该点所在的仓.
where min_index returns the index of the minimum value. The idea is that you can find the bin the point falls into by seeing what bin it has the smallest difference with.
但是我认为这有一些奇怪的情况.我正在寻找的是箱的良好表示形式,理想情况下是半封闭半开的箱(这样就无法将一个点分配给两个箱),即
But I think this has weird edge cases. What I am looking for is a good representation of bins, ideally ones that are half closed half open (so that there is no way of assigning one point to two bins), i.e.
bin1 = [x1, x2) bin2 = [x2, x3) bin3 = [x3, x4) etc...使用numpy/scipy在Python中执行此操作的好方法是什么?我在这里只关心合并整数值.
what is a good way to do this in Python, using numpy/scipy? I am only concerned here with binning integer values.
非常感谢您的帮助.
推荐答案numpy.histogram()确实满足您的要求.
函数签名为:
numpy.histogram(a, bins=10, range=None, normed=False, weights=None, new=None)我们对a和bins最为感兴趣. a是需要合并的输入数据. bins可以是许多仓(您的num_bins),也可以是标量序列,表示仓边缘(半开).
We're mostly interested in a and bins. a is the input data that needs to be binned. bins can be a number of bins (your num_bins), or it can be a sequence of scalars, which denote bin edges (half open).
import numpy values = numpy.arange(10, dtype=int) bins = numpy.arange(-1, 11) freq, bins = numpy.histogram(values, bins) # freq is now [0 1 1 1 1 1 1 1 1 1 1] # bins is unchanged引用文档:
除最后一个(最右边)的垃圾箱外,其他所有垃圾箱都是半开的.换句话说,如果bins是: [1, 2, 3, 4]
然后第一个bin是[1, 2)(包括1,但不包括2),第二个是[2, 3).但是,最后一个bin是[3, 4],其中包括 4.
then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.
编辑:您想知道每个元素箱中的索引.为此,您可以使用numpy.digitize().如果您的垃圾桶将成为一体,则也可以使用numpy.bincount().
Edit: You want to know the index in your bins of each element. For this, you can use numpy.digitize(). If your bins are going to be integral, you can use numpy.bincount() as well.
>>> values = numpy.random.randint(0, 20, 10) >>> values array([17, 14, 9, 7, 6, 9, 19, 4, 2, 19]) >>> bins = numpy.linspace(-1, 21, 23) >>> bins array([ -1., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21.]) >>> pos = numpy.digitize(values, bins) >>> pos array([19, 16, 11, 9, 8, 11, 21, 6, 4, 21])由于间隔是在上限处打开的,因此索引是正确的:
Since the interval is open on the upper limit, the indices are correct:
>>> (bins[pos-1] == values).all() True >>> import sys >>> for n in range(len(values)): ... sys.stdout.write("%g <= %g < %g\n" ... %(bins[pos[n]-1], values[n], bins[pos[n]])) 17 <= 17 < 18 14 <= 14 < 15 9 <= 9 < 10 7 <= 7 < 8 6 <= 6 < 7 9 <= 9 < 10 19 <= 19 < 20 4 <= 4 < 5 2 <= 2 < 3 19 <= 19 < 20更多推荐
将点分配到垃圾箱
发布评论