平衡numpy阵列与过度采样(balance numpy array with over

平衡numpy阵列与过度采样(balance numpy array with over-sampling)

请帮助我找到一种干净的方式来创建一个新的阵列。如果任何类别的示例数量小于该类别中示例的最大数量，则应该过度抽样。应该从原始数组中取样（不管是随机的还是顺序的）

比方说，初始数组是这样的：

[ 2, 29, 30, 1] [ 5, 50, 46, 0] [ 1, 7, 89, 1] [ 0, 10, 92, 9] [ 4, 11, 8, 1] [ 3, 92, 1, 0]

最后一列包含类：

classes = [ 0, 1, 9]

这些类的分布如下：

distrib = [2, 3, 1]

我需要的是创建一个具有相同数量的所有类的样本的新数组，从原始数组中随机取出，例如

[ 5, 50, 46, 0] [ 3, 92, 1, 0] [ 5, 50, 46, 0] # one example added [ 2, 29, 30, 1] [ 1, 7, 89, 1] [ 4, 11, 8, 1] [ 0, 10, 92, 9] [ 0, 10, 92, 9] # two examples [ 0, 10, 92, 9] # added

please help me finding a clean way to create a new array out of existing. it should be over-sampled, if the number of example of any class is smaller than the maximum number of examples in the class. samples should be taken from the original array (makes no difference, whether randomly or sequentially)

let's say, initial array is this:

[ 2, 29, 30, 1] [ 5, 50, 46, 0] [ 1, 7, 89, 1] [ 0, 10, 92, 9] [ 4, 11, 8, 1] [ 3, 92, 1, 0]

the last column contains classes:

classes = [ 0, 1, 9]

the distribution of the classes is the following:

distrib = [2, 3, 1]

what i need is to create a new array with equal number of samples of all classes, taken randomly from the original array, e.g.

[ 5, 50, 46, 0] [ 3, 92, 1, 0] [ 5, 50, 46, 0] # one example added [ 2, 29, 30, 1] [ 1, 7, 89, 1] [ 4, 11, 8, 1] [ 0, 10, 92, 9] [ 0, 10, 92, 9] # two examples [ 0, 10, 92, 9] # added

最满意答案

以下代码将完成以下任务：

a = np.array([[ 2, 29, 30, 1], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 0, 10, 92, 9], [ 4, 11, 8, 1], [ 3, 92, 1, 0]]) unq, unq_idx = np.unique(a[:, -1], return_inverse=True) unq_cnt = np.bincount(unq_idx) cnt = np.max(unq_cnt) out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype) for j in xrange(len(unq)): indices = np.random.choice(np.where(unq_idx==j)[0], cnt) out[j*cnt:(j+1)*cnt] = a[indices] >>> out array([[ 5, 50, 46, 0], [ 5, 50, 46, 0], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 4, 11, 8, 1], [ 2, 29, 30, 1], [ 0, 10, 92, 9], [ 0, 10, 92, 9], [ 0, 10, 92, 9]])

当numpy 1.9发布时，或者如果您从开发分支进行编译，那么前两行可以被压缩为：

unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True, return_counts=True)

请注意， np.random.choice工作方式不能保证原始数组的所有行将出现在输出数组中，如上例所示。如果需要的话，你可以做一些事情：

unq, unq_idx = np.unique(a[:, -1], return_inverse=True) unq_cnt = np.bincount(unq_idx) cnt = np.max(unq_cnt) out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype) slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt))) for j in xrange(len(unq)): indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j]) out[slices[j]:slices[j+1]] = a[indices] out = np.vstack((a, out)) >>> out array([[ 2, 29, 30, 1], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 0, 10, 92, 9], [ 4, 11, 8, 1], [ 3, 92, 1, 0], [ 5, 50, 46, 0], [ 0, 10, 92, 9], [ 0, 10, 92, 9]])

The following code does what you are after:

a = np.array([[ 2, 29, 30, 1], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 0, 10, 92, 9], [ 4, 11, 8, 1], [ 3, 92, 1, 0]]) unq, unq_idx = np.unique(a[:, -1], return_inverse=True) unq_cnt = np.bincount(unq_idx) cnt = np.max(unq_cnt) out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype) for j in xrange(len(unq)): indices = np.random.choice(np.where(unq_idx==j)[0], cnt) out[j*cnt:(j+1)*cnt] = a[indices] >>> out array([[ 5, 50, 46, 0], [ 5, 50, 46, 0], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 4, 11, 8, 1], [ 2, 29, 30, 1], [ 0, 10, 92, 9], [ 0, 10, 92, 9], [ 0, 10, 92, 9]])

When numpy 1.9 is released, or if you compile from the development branch, then the first two lines can be condensed into:

unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True, return_counts=True)

Note that, the way np.random.choice works, there is no guarantee that all rows of the original array will be present in the output one, as the example above shows. If that is needed, you could do something like:

unq, unq_idx = np.unique(a[:, -1], return_inverse=True) unq_cnt = np.bincount(unq_idx) cnt = np.max(unq_cnt) out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype) slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt))) for j in xrange(len(unq)): indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j]) out[slices[j]:slices[j+1]] = a[indices] out = np.vstack((a, out)) >>> out array([[ 2, 29, 30, 1], [ 5, 50, 46, 0], [ 1, 7, 89, 1], [ 0, 10, 92, 9], [ 4, 11, 8, 1], [ 3, 92, 1, 0], [ 5, 50, 46, 0], [ 0, 10, 92, 9], [ 0, 10, 92, 9]])

更多推荐

平衡numpy阵列与过度采样(balance numpy array with over

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表