平衡numpy阵列与过度采样(balance numpy array with over

编程入门 行业动态 更新时间:2024-10-15 20:23:07
平衡numpy阵列与过度采样(balance numpy array with over-sampling)

请帮助我找到一种干净的方式来创建一个新的阵列。 如果任何类别的示例数量小于该类别中示例的最大数量,则应该过度抽样。 应该从原始数组中取样(不管是随机的还是顺序的)

比方说,初始数组是这样的:

[ 2, 29, 30, 1] [ 5, 50, 46, 0] [ 1, 7, 89, 1] [ 0, 10, 92, 9] [ 4, 11, 8, 1] [ 3, 92, 1, 0]

最后一列包含类:

classes = [ 0, 1, 9]

这些类的分布如下:

distrib = [2, 3, 1]

我需要的是创建一个具有相同数量的所有类的样本的新数组,从原始数组中随机取出,例如

[ 5, 50, 46, 0] [ 3, 92, 1, 0] [ 5, 50, 46, 0] # one example added [ 2, 29, 30, 1] [ 1, 7, 89, 1] [ 4, 11, 8, 1] [ 0, 10, 92, 9] [ 0, 10, 92, 9] # two examples [ 0, 10, 92, 9] # added

please help me finding a clean way to create a new array out of existing. it should be over-sampled, if the number of example of any class is smaller than the maximum number of examples in the class. samples should be taken from the original array (makes no difference, whether randomly or sequentially)

let's say, initial array is this:

[ 2, 29, 30, 1] [ 5, 50, 46, 0] [ 1, 7, 89, 1] [ 0, 10, 92, 9] [ 4, 11, 8, 1] [ 3, 92, 1, 0]

the last column contains classes:

classes = [ 0, 1, 9]

the distribution of the classes is the following:

distrib = [2, 3, 1]

what i need is to create a new array with equal number of samples of all classes, taken randomly from the original array, e.g.

[ 5, 50, 46, 0] [ 3, 92, 1, 0] [ 5, 50, 46, 0] # one example added [ 2, 29, 30, 1] [ 1, 7, 89, 1] [ 4, 11, 8, 1] [ 0, 10, 92, 9] [ 0, 10, 92, 9] # two examples [ 0, 10, 92, 9] # added

最满意答案

以下代码将完成以下任务:

a = np.array([[ 2, 29, 30, 1], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 0, 10, 92, 9], [ 4, 11, 8, 1], [ 3, 92, 1, 0]]) unq, unq_idx = np.unique(a[:, -1], return_inverse=True) unq_cnt = np.bincount(unq_idx) cnt = np.max(unq_cnt) out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype) for j in xrange(len(unq)): indices = np.random.choice(np.where(unq_idx==j)[0], cnt) out[j*cnt:(j+1)*cnt] = a[indices] >>> out array([[ 5, 50, 46, 0], [ 5, 50, 46, 0], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 4, 11, 8, 1], [ 2, 29, 30, 1], [ 0, 10, 92, 9], [ 0, 10, 92, 9], [ 0, 10, 92, 9]])

当numpy 1.9发布时,或者如果您从开发分支进行编译,那么前两行可以被压缩为:

unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True, return_counts=True)

请注意, np.random.choice工作方式不能保证原始数组的所有行将出现在输出数组中,如上例所示。 如果需要的话,你可以做一些事情:

unq, unq_idx = np.unique(a[:, -1], return_inverse=True) unq_cnt = np.bincount(unq_idx) cnt = np.max(unq_cnt) out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype) slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt))) for j in xrange(len(unq)): indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j]) out[slices[j]:slices[j+1]] = a[indices] out = np.vstack((a, out)) >>> out array([[ 2, 29, 30, 1], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 0, 10, 92, 9], [ 4, 11, 8, 1], [ 3, 92, 1, 0], [ 5, 50, 46, 0], [ 0, 10, 92, 9], [ 0, 10, 92, 9]])

The following code does what you are after:

a = np.array([[ 2, 29, 30, 1], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 0, 10, 92, 9], [ 4, 11, 8, 1], [ 3, 92, 1, 0]]) unq, unq_idx = np.unique(a[:, -1], return_inverse=True) unq_cnt = np.bincount(unq_idx) cnt = np.max(unq_cnt) out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype) for j in xrange(len(unq)): indices = np.random.choice(np.where(unq_idx==j)[0], cnt) out[j*cnt:(j+1)*cnt] = a[indices] >>> out array([[ 5, 50, 46, 0], [ 5, 50, 46, 0], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 4, 11, 8, 1], [ 2, 29, 30, 1], [ 0, 10, 92, 9], [ 0, 10, 92, 9], [ 0, 10, 92, 9]])

When numpy 1.9 is released, or if you compile from the development branch, then the first two lines can be condensed into:

unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True, return_counts=True)

Note that, the way np.random.choice works, there is no guarantee that all rows of the original array will be present in the output one, as the example above shows. If that is needed, you could do something like:

unq, unq_idx = np.unique(a[:, -1], return_inverse=True) unq_cnt = np.bincount(unq_idx) cnt = np.max(unq_cnt) out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype) slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt))) for j in xrange(len(unq)): indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j]) out[slices[j]:slices[j+1]] = a[indices] out = np.vstack((a, out)) >>> out array([[ 2, 29, 30, 1], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 0, 10, 92, 9], [ 4, 11, 8, 1], [ 3, 92, 1, 0], [ 5, 50, 46, 0], [ 0, 10, 92, 9], [ 0, 10, 92, 9]])

更多推荐

本文发布于:2023-07-15 07:11:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1111365.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:阵列   numpy   array   balance

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!