我正在尝试解析一个大型XML文件,该文件是使用Python从网络上接收的.
I'm trying to parse a large XML file which is being received from the network in Python.
为此,我获取了数据并将其传递给 lxml.etree.iterparse
In order to do that, I get the data and pass it to lxml.etree.iterparse
但是,如果XML尚未完全发送,就像这样:
However, if the XML has yet to fully be sent, like so:
<MyXML> <MyNode foo="bar"> <MyNode foo="ba如果我运行 etree.iterparse(f,tag ='MyNode').next(),则无论它在什么地方被切断,我都会收到一个 XMLSyntaxError .
If I run etree.iterparse(f, tag='MyNode').next() I get an XMLSyntaxError at whereever it was cut off.
有什么办法可以使我收到第一个标签(即第一个MyNode),并且只有在到达文档的那一部分时才获得异常?(要使lxml真正地流式传输"内容,而在一开始不读取全部内容).
Is there any way I can make it so I can receive the first tag (i.e. the first MyNode) and only get an exception when I reach that part of the document? (To make lxml really 'stream' the contents and not read the whole thing in the beginning).
推荐答案XMLPullParser 和 HTMLPullParser 可能会更好地满足您的需求.他们通过重复调用 parser.feed(data)来获取数据.在树可用之前,您仍然必须等待所有数据输入.
XMLPullParser and HTMLPullParser may better suite your needs. They get their data by repeated calls to parser.feed(data). You still have to wait until all of the data comes in before the tree is usable.
更多推荐
使用python lxml解析部分XML
发布评论