我试图拆开一个看起来像这样的word文档:
1.0列表项 1.1清单项目 1.2列表项目 2.0列表项
它存储在docx中,我正在使用python-docx来尝试解析它。 不幸的是,它在开始时丢失了所有编号。 我正在尝试识别每个有序列表项的开头。
python-docx库也允许我访问样式,但我无法弄清楚如何确定样式是否为列表样式。
到目前为止,我一直在搞乱函数和检查输出,但标准格式是这样的:
for p in doc.paragraphs: s = p.style while s.base_style is not None: print s.name s = s.base_style print s.name我一直在尝试搜索自定义样式,但所有结束都在“正常”,而不是“ListNumber”。
我已经尝试在文档,段落和运行中搜索样式而没有运气。 我也试过搜索p.text,但如前所述,编号不会持续存在。
I'm trying to pull apart a word document that looks like this:
1.0 List item 1.1 List item 1.2 List item 2.0 List item
It is stored in docx, and I'm using python-docx to try to parse it. Unfortunately, it loses all the numbering at the start. I'm trying to identify the start of each ordered list item.
The python-docx library also allows me to access styles, but I cannot figure out how to determine whether the style is a list style or not.
So far I've been messing around with a function and checking output, but the standard format is something like:
for p in doc.paragraphs: s = p.style while s.base_style is not None: print s.name s = s.base_style print s.nameWhich I've been using to try to search up through the custom styles, but the all end at "Normal," as opposed to the "ListNumber."
I've tried searching styles under the document, the paragraphs, and the runs without luck. I've also tried searching p.text, but as previously mentioned the numbering does not persist.
最满意答案
列表项可以通过各种方式在XML中实现。 不幸的是,最常见的方法是使用工具栏添加列表项(而不是使用样式)也可能是最复杂的。
最好的办法是开始使用opc-diag来查看document.xml中使用的XML,然后从那里制定策略。
python-docx的列表处理API还没有真正实现,所以如果你想用今天的版本完成它,你需要在lxml级别运行。
List items can be implemented in the XML in a variety of ways. Unfortunately the most common way, adding list items using the toolbar (as opposed to using styles) is also probably the most complex.
Best bet is to start using opc-diag to have a look at the XML that's being used inside the document.xml and then formulating a strategy from there.
The list-handling API for python-docx hasn't really been implemented yet, so you'll need to operate at the lxml level if you want to get this done with today's version.
更多推荐
发布评论