Python3多处理(Python3 multiprocessing)

我是一个绝对的初学者。我通过类比示例来摸索代码，因此对任何滥用术语表示歉意。

我在python 3中编写了一小段代码，其中：

接受用户输入（计算机上的文件夹）在文件夹中搜索pdf文件将PDF的每个页面转换为带有顺序编号的图像。按照编号的顺序迭代jpgs，将它们变成黑色和白色。 OCR扫描文件并将文本输出到对象中，将文本内容保存到.txt文件（通过pytesseract）。删除jpgs，留下.txt文件。大部分时间用于转换为jpgs并可能使它们变成黑色和白色。

代码有效，但我相信它可以改进。这需要一段时间，所以我想我会尝试使用Pools进行多处理。

我的代码似乎创建了池。我还可以获得打印文件夹中文件列表的功能，因此它似乎以一种或另一种形式传递给它。

我无法让它工作，现在已经反复出现各种错误的代码。我认为主要的问题是，我无能为力。

我的代码开始了：

用户输入块（要求用户目录中的文件夹，检查它是否是有效的文件夹等）。

OCR块作为一个函数（解析PDF然后将内容输出到单个.txt文件）

对于循环块作为一个函数（应该循环遍历文件夹中的每个PDF并在其上执行OCR块。

多处理块（应该将目录中的文件列表提供给循环块。

为了避免编写War and Peace，我在下面列出了循环块和多处理块的最新版本：

#import necessary modules home_path = os.path.expanduser('~') #ask for input with various checking mechanisms to make sure a useful pdfDir is obtained pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:') def textExtractor(): #convert pdf to jpeg with a tesseract friendly resolution with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries #various lines of code here compilation_temp.close() def per_file_process (subject_files): for pdf in subject_files: #decode the whole file name as a string pdf_filename = os.fsdecode(pdf) #check whether the string ends in .pdf if pdf_filename.endswith(".pdf"): #call the OCR function on it textExtractor() else: print ('nonsense') if __name__ == '__main__': pool = Pool(2) pool.map(per_file_process, os.listdir(pdfDir))

有人愿意/能够指出我的错误吗？

工作时代码的相关位：

#import necessary home_path = os.path.expanduser('~') #block accepting input pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:') def textExtractor(): #convert pdf to jpeg with a tesseract friendly resolution with Img(filename=pdf_filename, resolution=300) as img: #need to think about using generic expanduser or other libraries to allow portability #various lines of code to OCR and output .txt file compilation_temp.close() subject_files = os.listdir(pdfDir) for pdf in subject_files: #decode the whole file name as a string you can see pdf_filename = os.fsdecode(pdf) #check whether the string ends in /pdf if pdf_filename.endswith(".pdf"): textExtractor() else: #print for debugging

I am an absolute beginner. I fumble my way through code by analogy to examples so apologies for any misuse of terminology.

I have written a small piece of code in python 3 which:

takes a user input (a folder on their computer) searches the folder for pdf files turns each page of the PDF to an image with sequential numbering. Iterates through the jpgs in order of numbering, turning them black and white. OCR scans the files and outputs the text into an object, saves the text contents to a .txt file (via pytesseract). Deletes jpgs, leaving .txt file. Most time is taken in converting to jpgs and possibly making them black and white.

The code works, though I am sure it could be improved. It takes a while so I thought I'd try multiprocessing using Pools.

My code appears to create pools. I can also get the function to print a list of files in the folder, so it appears to have the list passed to it in one form or another.

I cannot get it to work and have now hacked the code about repeatedly with various errors. I think the main problem is, I am clueless.

My code begins:

User input block (asks for a folder in the user's directory, checks it is a valid folder etc).

OCR block as a function (parses PDF then outputs contents into single .txt file)

For loop block as a function (is supposed to loop over each PDF in folder and execute OCR block on it.

Multiprocessing block (is supposed to feed the list of files in the directory to the loop block.

To avoid writing War and Peace, I set out last version of the loop block and multiprocessing blocks below:

#import necessary modules home_path = os.path.expanduser('~') #ask for input with various checking mechanisms to make sure a useful pdfDir is obtained pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:') def textExtractor(): #convert pdf to jpeg with a tesseract friendly resolution with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries #various lines of code here compilation_temp.close() def per_file_process (subject_files): for pdf in subject_files: #decode the whole file name as a string pdf_filename = os.fsdecode(pdf) #check whether the string ends in .pdf if pdf_filename.endswith(".pdf"): #call the OCR function on it textExtractor() else: print ('nonsense') if __name__ == '__main__': pool = Pool(2) pool.map(per_file_process, os.listdir(pdfDir))

Is anyone willing/able to point out my errors, please?

The relevant bits of the code whilst working:

#import necessary home_path = os.path.expanduser('~') #block accepting input pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:') def textExtractor(): #convert pdf to jpeg with a tesseract friendly resolution with Img(filename=pdf_filename, resolution=300) as img: #need to think about using generic expanduser or other libraries to allow portability #various lines of code to OCR and output .txt file compilation_temp.close() subject_files = os.listdir(pdfDir) for pdf in subject_files: #decode the whole file name as a string you can see pdf_filename = os.fsdecode(pdf) #check whether the string ends in /pdf if pdf_filename.endswith(".pdf"): textExtractor() else: #print for debugging

最满意答案

Pool.map使用os.listdir返回的每个名称重复调用worker函数。在per_file_process ， subject_files是单个文件名，而subject_files中的for pdf in subject_files:枚举名称中的各个字符。此外， listdir仅显示基本名称，没有子目录，因此您没有在pdf的正确位置查找。您可以使用glob按扩展名名称进行过滤，并返回文件的工作路径。

您的示例令人困惑... textExtractor()不带参数，那么如何知道它正在处理哪个文件？我正在走出困境，并假设它确实采取了文件处理的路径。如果是这样，只需通过map pdf的目录就可以轻松实现并行化。假设处理时间因pdf而异，我将chunksize设置为1，这样早期的整理工作者可以抓取额外的文件进行处理。

from glob import glob import os from multiprocessing import Pool def textExtractor(pdf_filename): #convert pdf to jpeg with a tesseract friendly resolution with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries #...various lines of code here compilation_temp.close() if __name__ == '__main__': #pdfDir is the folder inputted by user with Pool(2) as pool: # assuming call signature: textExtractor(path_to_file) pool.map(textExtractor, (filename for filename in glob(os.path.join(pdfDir, '*.pdf')) if os.path.isfile(filename)) chunksize=1)

Pool.map calls the worker function repeatedly with each name returned by os.listdir. In per_file_process, subject_files is a single filename and for pdf in subject_files: is enumerating the individual characters in the name. Further, listdir only shows the base name, without subdirectories, so you aren't looking in the right place for the pdf. You can use glob to filter by extension name and return a working path to the file.

Your example is confusing... textExtractor() takes no parameters so how is it to know which file it is processing? I'm going out on a limb and assuming that it really does take the path to the file processing. If so, you can parallelize rather easily just by feeding pdf's directory it via map. Assuming processing time will vary by pdf, I am setting chunksize to 1 so that an early finishing worker can grap extra files to process.

from glob import glob import os from multiprocessing import Pool def textExtractor(pdf_filename): #convert pdf to jpeg with a tesseract friendly resolution with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries #...various lines of code here compilation_temp.close() if __name__ == '__main__': #pdfDir is the folder inputted by user with Pool(2) as pool: # assuming call signature: textExtractor(path_to_file) pool.map(textExtractor, (filename for filename in glob(os.path.join(pdfDir, '*.pdf')) if os.path.isfile(filename)) chunksize=1)

更多推荐

Python3多处理(Python3 multiprocessing)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表