是否可以从zipfile加载带有openpyxl的xlsx(Is it possible to load xlsx with openpyxl from zipfile)

编程入门行业动态更新时间:2024-10-25 16:26:29

我试图从压缩的zip文件openpyxl.load_workbook xlsx文件，但它不起作用。以下代码在openpyxl.load_workbook中失败，出现“BadZipfile：File不是zip文件”

with zipfile.ZipFile(os.path.join(root, raw)) as z: for file_info in z.infolist(): wb = openpyxl.load_workbook(z.open(file_info), read_only=True)

存档和excel文件没有任何问题，就像我将其提取到磁盘然后以下工作：

with open('report.xlsx') as f: wb = openpyxl.load_workbook(f, read_only=True)

我可以使用此解决方案并临时提取它并加载xslx，但是想了解是否可以从zipfile加载它。

I'm trying to openpyxl.load_workbook xlsx files from compressed zip file, but it doesn't work. The following code fails at openpyxl.load_workbook with "BadZipfile: File is not a zip file"

with zipfile.ZipFile(os.path.join(root, raw)) as z: for file_info in z.infolist(): wb = openpyxl.load_workbook(z.open(file_info), read_only=True)

There is nothing wrong with the archive and the excel file in it, as if i extract it to disk then the following works:

with open('report.xlsx') as f: wb = openpyxl.load_workbook(f, read_only=True)

I can go with this solution and temporary extract it somewhere and load xslx, but would like to understand if it possible to load it from zipfile.

最满意答案

问题是readonly=True并不能完成您的想法。根据文件：

幸运的是，有两种模式可以让您在（接近）恒定内存消耗的情况下读取和写入无限量的数据。

虽然没有明确说明，但我认为这涉及一些等效于内存映射文件（因为“常量内存消耗”）和随机访问（因为允许操作的范围）。

无论哪种方式，设置readonly=True并不表示您只打算读取工作簿（无论如何，所有load_workbook都可以执行，您必须覆盖现有的工作簿以进行任何“更改”）。这表明您希望直接在磁盘上访问该文件，而无需加载整个内容。

ZipFile.open似乎非常清楚（并且直观地预期）不提供随机访问文件：

注意：类文件对象是只读的，并提供以下方法： read() ， readline() ， readlines() ， __iter__() ， next() 。

在这个列表中没有提到seek的事实是非常有说服力的（双关语只是有点意图）。

您可以通过将有问题的行拆分为两个来获得有关异常的更多信息（对于嵌套函数调用，这是一种有用的常规调试技术）：

x = z.open(file_info) wb = openpyxl.load_workbook(x, readonly=True)

您会注意到这两行中的第二行发生错误。这是因为几乎所有的Microsoft开放文档格式实际上都只是花哨的zip文件。问题很可能是openpyxl无法以随机访问模式打开您的文件，而不是它实际上是一个无效的zip文件。

无论哪种方式，这是一堆非常有教育意义的猜测，导致一个简单的，一个关键字删除解决方案：

TL; DR

读取非随机访问数据（如压缩的zip条目）时，请删除readonly=True ：

wb = openpyxl.load_workbook(z.open(file_info))

附录

你应该养成写出能够证明你的问题的最小程序的习惯，这样回答你问题的人就可以集中精力完成自己的工作，而不是生气，并关闭那些本来不错的问题。我喜欢你的问题足以为你做到这一点，所以这里有一个最小的程序来演示你的问题，只需复制并粘贴即可运行：

import openpyxl, zipfile from openpyxl.workbook.workbook import Workbook wb = Workbook() wb.active['A1'] = 12 wb.active['A2'] = 13 wb.save('report.xlsx') with zipfile.ZipFile('test.zip', 'w') as z: z.write('report.xlsx') with open('report.xlsx') as f: wb = openpyxl.load_workbook(f, read_only=True) print(wb.active['A1'].value) print(wb.active['A2'].value) with zipfile.ZipFile('test.zip', 'r') as z: for file_info in z.infolist(): x = z.open(file_info, 'r') wb = openpyxl.load_workbook(x, readonly=True) print(wb.active['A1'].value) print(wb.active['A2'].value)

The problem is that readonly=True does not do quite what you think it does. According to the docs:

Fortunately, there are two modes that enable you to read and write unlimited amounts of data with (near) constant memory consumption.

While not explicitly stated, I would assume that this involves some equivalent to a memory-mapped file (because of "constant memory consumption") and random access (because of the range of allowed operations).

Either way, setting readonly=True is not an indication of the fact that you only intend to read a workbook (that's all load_workbook can do anyway, you have to overwrite the existing one to make any "changes"). It is an indication of the fact that you want to access the file directly on disk, without loading the entire contents.

It seems pretty clear (and intuitively expected) that ZipFile.open does not provide a random-access file:

Note: The file-like object is read-only and provides the following methods: read(), readline(), readlines(), __iter__(), next().

The fact that seek is not mentioned in this list is quite telling (pun only somewhat intended).

You can get more information about the exception by splitting the offending line into two (a useful general debugging technique for nested function calls):

x = z.open(file_info) wb = openpyxl.load_workbook(x, readonly=True)

You will notice that the error occurs on the second of those two lines. This is because pretty much all the Microsoft open-document formats are actually just fancy zip files. The problem is most likely that openpyxl cannot open your file in random access mode, not that it's actually an invalid zip file.

Either way, this is a bunch of very educated guesswork that leads to a simple, one-keyword-deletion solution:

TL;DR

Get rid of readonly=True when reading non-random-access data like a compressed zip entry:

wb = openpyxl.load_workbook(z.open(file_info))

Appendix

You should get in the habit of writing minimal programs that demonstrate your issue so that people answering your question can focus on doing their job instead of getting irritated and closing down what would otherwise be a perfectly good question. I liked your question enough to do that for you, so here is a minimal program that demonstrates your issue and requires nothing more than copy-and-paste to run:

更多推荐

本文发布于:2023-07-26 17:12:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1278488.html