使用Python Dask包会降低性能吗？

编程入门行业动态更新时间:2024-10-28 21:22:04

本文介绍了使用Python Dask包会降低性能吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在测试dask.bag的一些测试，以准备处理数百万个文本文件的大型文本处理工作。现在，在我的数十至数十万个文本文件的测试集上，我发现dask的运行速度比直接的单线程文本处理功能慢5到6倍。

I'm trying out some tests of dask.bag to prepare for a big text processing job over millions of text files. Right now, on my test sets of dozens to hundreds of thousands of text files, I'm seeing that dask is running about 5 to 6 times slower than a straight single-threaded text processing function.

有人可以解释一下在大量文本文件上运行dask带来的速度优势吗？在开始变得更快之前，我必须处理多少个文件？ 150,000个小的文本文件太少了吗？在处理文件时，我应该调整哪种性能参数以加快速度？与纯单线程文本处理相比，性能下降了5倍的原因是什么？

Can someone explain where I'll see the speed benefits of running dask over a large amount of text files? How many files would I have to process before it starts getting faster? Is 150,000 small text files simply too few? What sort of performance parameters should I be tweaking to get dask to speed up when processing files? What could account for a 5x decrease in performance over straight single-threaded text processing?

这是我用来测试代码的示例。这与来自路透社的测试数据集有关，该数据集位于：

Here's an example of the code I'm using to test dask out. This is running against a test set of data from Reuters located at:

www.daviddlewis/resources/testcollections/reuters21578/

此数据并不完全相同作为我要处理的数据。在其他情况下，则是一堆单独的文本文件，每个文件一个文档，但是我看到的性能下降大致相同。代码如下：

This data isn't exactly the same as the data I'm working against. In my other case it's a bunch of individual text files, one document per file, but the performance decrease I'm seeing is about the same. Here's the code:

import dask.bag as db from collections import Counter import string import glob import datetime my_files = "./reuters/*.ascii" def single_threaded_text_processor(): c = Counter() for my_file in glob.glob(my_files): with open(my_file, "r") as f: d = f.read() c.update(d.split()) return(c) start = datetime.datetime.now() print(single_threaded_text_processor().most_common(5)) print(str(datetime.datetime.now() - start)) start = datetime.datetime.now() b = db.read_text(my_files) wordcount = b.str.split().concat().frequencies().topk(5, lambda x: x[1]) print(str([w for w in wordcount])) print(str(datetime.datetime.now() - start))

这是我的结果：

[('the', 119848), ('of', 72357), ('to', 68642), ('and', 53439), ('in', 49990)] 0:00:02.958721 [(u'the', 119848), (u'of', 72357), (u'to', 68642), (u'and', 53439), (u'in', 49990)] 0:00:17.877077

推荐答案

Dask大约需要为每个任务花费1毫秒的开销。默认情况下， dask.bag.read_text 函数为每个文件名创建一个任务。我怀疑您只是被开销所淹没。

Dask incurs about a cost of roughly 1ms overhead per task. By default the dask.bag.read_text function creates one task per filename. I suspect that you're just being swamped by overhead.

这里的解决方案可能是在一个任务中处理多个文件。 read_text函数不会为您提供任何选择，但是您可以切换到 dask .delayed ，它提供了更多的灵活性，然后在需要时转换为dask.bag。

The solution here is probably to process several files in one task. The read_text function doesn't give you any options to do this, but you could switch out to dask.delayed, which provides a bit more flexibility and then convert to a dask.bag later if preferred.