如何在python中解析表示xml.dom.minidom节点的字符串？(How to parse strings representing xml.dom.minidom nodes in pyth

编程入门行业动态更新时间:2024-10-22 07:19:05

如何在python中解析表示xml.dom.minidom节点的字符串？(How to parse strings representing xml.dom.minidom nodes in python?)

我有一个使用xml.dom.minidom创建的节点xml.dom.Node对象的集合。我将它们（单独）存储在数据库中，方法是使用Node对象的toxml（）方法将它们转换为字符串。

问题是我有时希望能够使用某种解析器将它们转换回适当的Node对象。据我所知，python附带的各种库使用Expat，它不解析像''这样的字符串，或者实际上不是正确的xml字符串。

那么，有没有人有任何想法？我意识到我可以用某种方式腌制节点然后将它们拆开，但这感觉很不愉快，而且我更愿意以我可以阅读的形式存储以便进行维护。当然有一些东西可以做到这一点？

为了回应怀疑表示这是可能的，我的意思是一个例子：

>>> import xml.dom.minidom >>> x=xml.dom.minidom.parseString('<a>foo<b>thing</b></a>') >>> x.documentElement.childNodes[0] <DOM Text node "u'foo'"> >>> x.documentElement.childNodes[0].toxml() u'foo' >>> xml.dom.minidom.parseString(x.documentElement.childNodes[0].toxml()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString return expatbuilder.parseString(string) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString return builder.parseString(string) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: syntax error: line 1, column 0

换句话说，“。toxml（）”方法不会创建Expat（因此开箱即用的parseString）将解析的东西。

我想要的是将u'foo'解析为文本节点的东西。也就是说会扭转.toxml（）的影响

I have a collection of nodes xml.dom.Node objects created using xml.dom.minidom. I store them (individually) in a database by converting them to a string using the toxml() method of a the Node object.

The problem is that I'd sometimes like to be able to convert them back to the appropriate Node object using a parser of some kind. As far as I can see the various libraries shipped with python use Expat which won't parse a string like '' or indeed anything which is not a correct xml string.

So, does anyone have any ideas? I realise I could pickle the nodes in some way and then unpickle them, but that feels unpleasant and I'd much rather be storing in a form I can read for maintenance purposes. Surely there is something that will do this?

In response to the doubt expressed that this is possible, an example of what I mean:

In other words the ".toxml()" method does not create something that Expat (and hence out of the box parseString) will parse.

What I would like is something that will parse u'foo' into a text node. I.e. something that will reverse the effect of .toxml()

最满意答案

您需要存储哪些类型的节点？

显然，如果使用.toxml('utf-8')序列化，则元素节点应该正常工作; 结果应该可以作为XML文档解析，并且可以从documentElement检索元素，只要其中没有需要在doctype中定义的EntityReferences。

另一方面，文本节点需要HTML解码或一些包装来解析。如果您只需要元素和文本节点，您可以猜测它是否是第一个字符的元素，因为它必须始终为<对于元素：

var xml= node.toxml('utf-8') ... if (xml.startswith('<')): node= minidom.parseString(xml).documentElement else: node= minidom.parseString('<x>%s</x>'%xml).documentElement.firstChild

通过检查<!--可以类似地存储注释节点。

像Attr这样的其他节点类型会更加有效，因为它们的XML表示不容易与Text区分开来。您可能需要存储带外nodeType值才能记住它。 OTOH minidom无论如何都没有在Attr上实现toxml()所以也许这不是问题。

What types of node do you need to store?

Obviously Element nodes should just work if serialised with .toxml('utf-8'); the results should be parseable as an XML document as-is and the element retrievable from documentElement, as long as there are no EntityReferences inside it that would need definition in the doctype.

Text nodes, on the other hand, would need either HTML-decoding or some wrapping to parse. If you only needed elements and text nodes you could guess whether it was an element from the first character, since that must always be < for an element:

var xml= node.toxml('utf-8') ... if (xml.startswith('<')): node= minidom.parseString(xml).documentElement else: node= minidom.parseString('<x>%s</x>'%xml).documentElement.firstChild

Comment nodes could similarly be stored by checking for <!--.

Other node types like Attr would be more work since their XML representation is not easily distinguishable from Text. You would probably need to store an out-of-band nodeType value to remember it. OTOH minidom doesn't implement toxml() on Attr anyway so maybe that's not an issue.

更多推荐

本文发布于:2023-07-31 00:37:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1340530.html