如何chunk csv（dict）读者对象在python 3.2？

编程入门行业动态更新时间:2024-10-17 00:22:15

本文介绍了如何chunk csv（dict）读者对象在python 3.2？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我尝试使用多处理模块中的Pool来加速大型csv文件的读取。为此，我调整了示例（来自py2k），但它似乎csv.dictreader对象没有长度。这是否意味着我只能迭代它？

I try to use Pool from the multiprocessing module to speed up reading in large csv files. For this, I adapted an example (from py2k), but it seems like the csv.dictreader object has no length. Does it mean I can only iterate over it? Is there a way to chunk it still?

这些问题似乎相关，但没有真正回答我的问题： csv.DictReader 中的行数，如何在Python 3中列出列表？

These questions seemed relevant, but did not really answer my question: Number of lines in csv.DictReader, How to chunk a list in Python 3?

代码试图这样做：

source = open('/scratch/data.txt','r') def csv2nodes(r): strptime = time.strptime mktime = time.mktime l = [] ppl = set() for row in r: cell = int(row['cell']) id = int(row['seq_ei']) st = mktime(strptime(row['dat_deb_occupation'],'%d/%m/%Y')) ed = mktime(strptime(row['dat_fin_occupation'],'%d/%m/%Y')) # collect list l.append([(id,cell,{1:st,2: ed})]) # collect separate sets ppl.add(id) return (l,ppl) def csv2graph(source): r = csv.DictReader(source,delimiter=',') MG=nx.MultiGraph() l = [] ppl = set() # Remember that I use integers for edge attributes, to save space! Dic above. # start: 1 # end: 2 p = Pool(processes=4) node_divisor = len(p._pool)*4 node_chunks = list(chunks(r,int(len(r)/int(node_divisor)))) num_chunks = len(node_chunks) pedgelists = p.map(csv2nodes, zip(node_chunks)) ll = [] for l in pedgelists: ll.append(l[0]) ppl.update(l[1]) MG.add_edges_from(ll) return (MG,ppl)

推荐答案

从 csv.DictReader 文档（和 csv.reader 类它子类），该类返回一个迭代器。当您调用 len（）时，代码应该抛出 TypeError 。

From the csv.DictReader documentation (and the csv.reader class it subclasses), the class returns an iterator. The code should have thrown a TypeError when you called len().

你仍然可以块化数据，但是你必须将它完全读入内存。如果你关心内存，可以从 csv.DictReader 切换到 csv.reader ，并跳过字典 csv.DictReader 创建。要提高 csv2nodes（）中的可读性，可以为每个字段的索引分配常量：

You can still chunk the data, but you'll have to read it entirely into memory. If you're concerned about memory you can switch from csv.DictReader to csv.reader and skip the overhead of the dictionaries csv.DictReader creates. To improve readability in csv2nodes(), you can assign constants to address each field's index:

CELL = 0 SEQ_EI = 1 DAT_DEB_OCCUPATION = 4 DAT_FIN_OCCUPATION = 5

我还建议使用不同于 id 的变量，因为这是一个内置的函数名。

I also recommend using a different variable than id, since that's a built-in function name.