大数组的块状直方图

编程入门 行业动态 更新时间:2024-10-24 18:17:11
本文介绍了大数组的块状直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一堆csv数据集,每个数据集的大小约为10Gb.我想从他们的列中生成直方图.但是似乎在numpy中执行此操作的唯一方法是首先将整个列加载到numpy数组中,然后在该数组上调用numpy.histogram.这会消耗不必要的内存.

I have a bunch of csv datasets, about 10Gb in size each. I'd like to generate histograms from their columns. But it seems like the only way to do this in numpy is to first load the entire column into a numpy array and then call numpy.histogram on that array. This consumes an unnecessary amount of memory.

numpy是否支持在线分箱?我希望在我的csv上逐行进行迭代,并在读取它们时将值归类.这样,任何时候最多只有一行存储在内存中.

Does numpy support online binning? I'm hoping for something that iterates over my csv line by line and bins values as it reads them. This way at most one line is in memory at any one time.

不难推出自己的产品,但想知道是否有人已经发明了这种车轮.

Wouldn't be hard to roll my own, but wondering if someone already invented this wheel.

推荐答案

正如您所说,推出自己的产品并不难.您需要自己设置垃圾箱,并在遍历文件时重复使用它们.以下应该是一个不错的起点:

As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:

import numpy as np datamin = -5 datamax = 5 numbins = 20 mybins = np.linspace(datamin, datamax, numbins) myhist = np.zeros(numbins-1, dtype='int32') for i in range(100): d = np.random.randn(1000,1) htemp, jnk = np.histogram(d, mybins) myhist += htemp

我猜想对于如此大的文件,性能将是一个问题,并且在每一行上调用直方图的开销可能太慢. @doug关于生成器的建议似乎是解决该问题的一种好方法问题.

I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. @doug's suggestion of a generator seems like a good way to address that problem.

更多推荐

大数组的块状直方图

本文发布于:2023-11-08 15:48:51,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1569798.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:直方图   块状   数组

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!