如何chunk csv(dict)读者对象在python 3.2?

编程入门 行业动态 更新时间:2024-10-17 00:22:15
本文介绍了如何chunk csv(dict)读者对象在python 3.2?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我尝试使用多处理模块中的Pool来加速大型csv文件的读取。为此,我调整了示例(来自py2k),但它似乎csv.dictreader对象没有长度。这是否意味着我只能迭代它?

I try to use Pool from the multiprocessing module to speed up reading in large csv files. For this, I adapted an example (from py2k), but it seems like the csv.dictreader object has no length. Does it mean I can only iterate over it? Is there a way to chunk it still?

这些问题似乎相关,但没有真正回答我的问题: csv.DictReader 中的行数, 如何在Python 3中列出列表?

These questions seemed relevant, but did not really answer my question: Number of lines in csv.DictReader, How to chunk a list in Python 3?

代码试图这样做:

source = open('/scratch/data.txt','r') def csv2nodes(r): strptime = time.strptime mktime = time.mktime l = [] ppl = set() for row in r: cell = int(row['cell']) id = int(row['seq_ei']) st = mktime(strptime(row['dat_deb_occupation'],'%d/%m/%Y')) ed = mktime(strptime(row['dat_fin_occupation'],'%d/%m/%Y')) # collect list l.append([(id,cell,{1:st,2: ed})]) # collect separate sets ppl.add(id) return (l,ppl) def csv2graph(source): r = csv.DictReader(source,delimiter=',') MG=nx.MultiGraph() l = [] ppl = set() # Remember that I use integers for edge attributes, to save space! Dic above. # start: 1 # end: 2 p = Pool(processes=4) node_divisor = len(p._pool)*4 node_chunks = list(chunks(r,int(len(r)/int(node_divisor)))) num_chunks = len(node_chunks) pedgelists = p.map(csv2nodes, zip(node_chunks)) ll = [] for l in pedgelists: ll.append(l[0]) ppl.update(l[1]) MG.add_edges_from(ll) return (MG,ppl)

推荐答案

从 csv.DictReader 文档(和 csv.reader 类它子类),该类返回一个迭代器。当您调用 len()时,代码应该抛出 TypeError 。

From the csv.DictReader documentation (and the csv.reader class it subclasses), the class returns an iterator. The code should have thrown a TypeError when you called len().

你仍然可以块化数据,但是你必须将它完全读入内存。如果你关心内存,可以从 csv.DictReader 切换到 csv.reader ,并跳过字典 csv.DictReader 创建。要提高 csv2nodes()中的可读性,可以为每个字段的索引分配常量:

You can still chunk the data, but you'll have to read it entirely into memory. If you're concerned about memory you can switch from csv.DictReader to csv.reader and skip the overhead of the dictionaries csv.DictReader creates. To improve readability in csv2nodes(), you can assign constants to address each field's index:

CELL = 0 SEQ_EI = 1 DAT_DEB_OCCUPATION = 4 DAT_FIN_OCCUPATION = 5

我还建议使用不同于 id 的变量,因为这是一个内置的函数名。

I also recommend using a different variable than id, since that's a built-in function name.

更多推荐

如何chunk csv(dict)读者对象在python 3.2?

本文发布于:2023-05-28 06:57:54,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/315153.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:对象   读者   csv   chunk   dict

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!