问题描述
限时送ChatGPT账号..void test()
{
QDomDocument doc("doc");
QByteArray data = "<div><p>Of course, “Jason.” My thoughts, exactly.</p></div>";
QString sErrorMsg;
int errLine, errCol;
if (!doc.setContent(data, &sErrorMsg, &errLine, &errCol)) {
qDebug() << sErrorMsg;
qDebug() << errLine << ":" << errCol;
return;
}
QDomNodeList pList = doc.elementsByTagName("p");
for (int i = 0; i < pList.size(); i++)
{
QDomNode p = pList.at(i);
while (!p.isNull()) {
QDomElement e = p.toElement();
if (!e.isNull()) {
QByteArray ba = e.text().toUtf8(); //Here, there is no left and right quota marks anymore.
}
p = p.nextSibling();
}
}
}
我正在用 “
和 ”
解析一个 html 短语.代码运行到 QByteArray ba = e.text().toUtf8();
没有配额标记.
I'm parsing a html phrase with “
and ”
. The code runs to QByteArray ba = e.text().toUtf8();
without the quota marks.
我如何保留它们?
推荐答案
我必须承认这是我第一次使用 QDomDocument 虽然我已经对 XML 和 libXml2<有一些经验/a> 特别是.
I must admit that this is the first time that I used QDomDocument although I already have some experience with XML in general and libXml2 specifically.
首先,我可以确认 QDomElement::text() 返回没有实体编码的印刷引号的文本.
First, I can confirm that QDomElement::text() returns text without the typographical quotes encoded by entities.
我稍微修改了 OP 的 MCVE,现在应该很明显为什么会发生这种情况了.
I modified the MCVE of OP a bit and now, it should be obvious why this happens.
我的testQDomDocument
:
#include <QtXml>
static const char* toString(QDomNode::NodeType nodeType);
int main(int, char**)
{
QByteArray text = "<div><p>Of course, “Jason.” My thoughts, exactly.</p></div>";
// setup doc. DOM
QDomDocument qDomDoc("doc");
QString qErrorMsg; int errorLine = 0, errorCol = 0;
if (!qDomDoc.setContent(text, &qErrorMsg, &errorLine, &errorCol)) {
qDebug() << "Line:" << errorLine << "Col.:" << errorCol << qErrorMsg;
return 1;
}
// inspect DOM
QDomNodeList qListP = qDomDoc.elementsByTagName("p");
const int nP = qListP.size();
qDebug() << "Number of found <p> nodes:" << nP;
for (int i = 0; i < nP; ++i) {
const QDomNode qNodeP = qListP.at(i);
qDebug() << "node <p> #" << i;
qDebug() << "node.toElement().text(): " << qNodeP.toElement().text();
for (QDomNode qNode = qNodeP.firstChild(); !qNode.isNull(); qNode = qNode.nextSibling()) {
qDebug() << toString(qNode.nodeType());
switch (qNode.nodeType()) {
case QDomNode::TextNode:
#if 1 // IMHO, the correct way:
qDebug() << qNode.toText().data();
#else // works as well:
qDebug() << qNode.nodeValue();
#endif // 1
break;
case QDomNode::EntityReferenceNode:
qDebug() << qNode.nodeName();
break;
default:; // rest of types left out to keep sample short
}
}
}
// done
return 0;
}
const char* toString(QDomNode::NodeType nodeType)
{
static const std::map<QDomNode::NodeType, const char*> mapNodeTypes {
{ QDomNode::ElementNode, "QDomNode::ElementNode" },
{ QDomNode::AttributeNode, "QDomNode::AttributeNode" },
{ QDomNode::TextNode, "QDomNode::TextNode" },
{ QDomNode::CDATASectionNode, "QDomNode::CDATASectionNode" },
{ QDomNode::EntityReferenceNode, "QDomNode::EntityReferenceNode" },
{ QDomNode::EntityNode, "QDomNode::EntityNode" },
{ QDomNode::ProcessingInstructionNode, "QDomNode::ProcessingInstructionNode" },
{ QDomNode::CommentNode, "QDomNode::CommentNode" },
{ QDomNode::DocumentNode, "QDomNode::DocumentNode" },
{ QDomNode::DocumentTypeNode, "QDomNode::DocumentTypeNode" },
{ QDomNode::DocumentFragmentNode, "QDomNode::DocumentFragmentNode" },
{ QDomNode::NotationNode, "QDomNode::NotationNode" },
{ QDomNode::BaseNode, "QDomNode::BaseNode" },
{ QDomNode::CharacterDataNode, "QDomNode::CharacterDataNode" }
};
const std::map<QDomNode::NodeType, const char*>::const_iterator iter
= mapNodeTypes.find(nodeType);
return iter != mapNodeTypes.end() ? iter->second : "<ERROR>";
}
Qt 项目文件 –testQDomDocument.pro
:
The Qt project file – testQDomDocument.pro
:
SOURCES = testQDomDocument
QT += xml
构建和测试:
$ qmake-qt5 testQDomDocument.pro
$ make && ./testQDomDocument
g++ -c -fno-keep-inline-dllexport -D_GNU_SOURCE -pipe -O2 -Wall -W -D_REENTRANT -DQT_NO_DEBUG -DQT_GUI_LIB -DQT_XML_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt5 -isystem /usr/include/qt5/QtGui -isystem /usr/include/qt5/QtXml -isystem /usr/include/qt5/QtCore -I. -I/usr/lib/qt5/mkspecs/cygwin-g++ -o testQDomDocument.o testQDomDocument
g++ -o testQDomDocument.exe testQDomDocument.o -lQt5Gui -lQt5Xml -lQt5Core -lGL -lpthread
Number of found <p> nodes: 1
node <p> # 0
node.toElement().text(): "Of course, Jason. My thoughts, exactly."
QDomNode::TextNode
"Of course, "
QDomNode::EntityReferenceNode
"ldquo"
QDomNode::TextNode
"Jason."
QDomNode::EntityReferenceNode
"rdquo"
QDomNode::TextNode
" My thoughts, exactly."
$
要了解发生了什么,了解 <的 To understand what happened it helps to know that the contents of 因此, So, the 实体( The entities ( 我必须承认我有点惊讶,因为(根据我在 I must admit I was a bit surprised because (according to my experience in QDomEntityReference中的段落: 此外,XML 处理器可以在构建 DOM 树时完全扩展对实体的引用,而不是提供 QDomEntityReference 对象. Moreover, the XML processor may completely expand references to entities while building the DOM tree, instead of providing QDomEntityReference objects. 支持我对 supported my same expectation for 然而,样本表明在这种情况下情况并非如此. However, the sample shows that this isn't true in this case. 三思而后行,我意识到 Thinking twice, I realized that 在HTML5(及之前)中是这种情况,但在一般的 XML 中则不然. This is the case in HTML5 (and before) but not in general XML. XML 中唯一的预定义实体是: The only predefined entities in XML are: 所以,为了替换 HTML 实体, So, for the replacement of HTML entities, something else is needed in 顺便说一句.在寻找这个方向的提示时,我偶然发现: Btw. while looking for a hint into this direction, I stumbled into: SO:QDomDocument 无法设置带有标签的 HTML 文档的内容 我想了一会儿如何解决这个问题. I thought a while about how this can be fixed. 我想知道我没有立即想到一个非常简单的修复:用数字字符引用替换实体. I wonder that I didn't think immediately on a very simple fix:
replacing the entities by numeric character references. 对上述示例稍作修改: 我得到以下输出: 等等!现在, Et voilà! Now, there is only one child node in 虽然,引号的输出为 Though, the output of the quotes as 简短检查 UTF-8 编码表和 Unicode字符让我相信这些 UTF-8 字节序列是正确的. A short check in UTF-8 encoding table and Unicode characters convinced me that these UTF-8 byte sequences are correct. 啊哈.这似乎是由 Aha. That rather seems to be caused by 这篇关于我怎样才能保持&ldquo;当我使用 QDomDocument 解析 html 数据时?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! 的内容没有存储在
QDomNode
实例中会有所帮助/code> 直接.相反,<p>
的 QDomNode
实例(以及任何其他元素)具有子节点来存储其内容,例如一个 QDomText 实例来存储一段文本.<p>
isn't stored in the QDomNode
instance for <p>
directly. Instead, the QDomNode
instance for <p>
(as well as any other element) has child nodes to store its contents, e.g. a QDomText instance to store a piece of text.QDomElement::text()
是一个方便的函数,它仅返回(收集的)文本,但似乎忽略了任何其他节点.在 OP 示例中,并非 的
QDomElement
的所有子节点都是文本节点.QDomElement::text()
is a convenience function which returns only the (collected) text but seems to ignore any other nodes.
In OPs sample, not all child nodes of the QDomElement
for <p>
are text nodes.“
、”
)存储为<a href="https://doc.qt.io/qt-5/qdomentityreference.html" rel="nofollow noreferrer">QDomEntityReference 实例,显然在 QDomElement::text()
中跳过了.“
, ”
) are stored as QDomEntityReference instances and obviously skipped in QDomElement::text()
.libXml2
中的经验)我已经习惯了实体也被解析为文本的事实.libXml2
) I'm used to the fact that entities are resolved into text as well.
QDomDocument
的相同期望.QDomDocument
.“
和”
不是 XML 中的预定义实体.“
and ”
are not predefined entities in XML.Name | Chr. | Codepoint | Meaning
-----+------+-------------+-----------------
quot | " | U+0022 (34) | quotation mark
amp | & | U+0026 (38) | ampersand
apos | ' | U+0027 (39) | apostrophe
lt | < | U+003C (60) | less-than sign
gt | > | U+003E (62) | greater-than sign
QDomDocument
中还需要一些其他的东西.QDomDocument
.HTML Entity | NCR
------------+----------
“ | “
” | ”
int main(int, char**)
{
QByteArray text =
"<div><p>Of course, “Jason.” My thoughts, exactly.</p></div>";
// setup doc. DOM
QDomDocument qDomDoc("doc");
QString qErrorMsg; int errorLine = 0, errorCol = 0;
if (!qDomDoc.setContent(text, &qErrorMsg, &errorLine, &errorCol)) {
qDebug() << "Line:" << errorLine << "Col.:" << errorCol << qErrorMsg;
return 1;
}
// inspect DOM
QDomNodeList qListP = qDomDoc.elementsByTagName("p");
const int nP = qListP.size();
qDebug() << "Number of found <p> nodes:" << nP;
for (int i = 0; i < nP; ++i) {
const QDomNode qNodeP = qListP.at(i);
qDebug() << "node <p> #" << i;
qDebug() << "node.toElement().text(): " << qNodeP.toElement().text().toUtf8();
for (QDomNode qNode = qNodeP.firstChild(); !qNode.isNull(); qNode = qNode.nextSibling()) {
qDebug() << toString(qNode.nodeType());
switch (qNode.nodeType()) {
case QDomNode::TextNode:
qDebug() << qNode.toText().data().toUtf8();
break;
case QDomNode::EntityReferenceNode:
qDebug() << qNode.nodeName();
break;
default:; // rest of types left out to keep sample short
}
}
}
// done
return 0;
}
$ make && ./testQDomDocument
g++ -c -fno-keep-inline-dllexport -D_GNU_SOURCE -pipe -O2 -Wall -W -D_REENTRANT -DQT_NO_DEBUG -DQT_GUI_LIB -DQT_XML_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt5 -isystem /usr/include/qt5/QtGui -isystem /usr/include/qt5/QtXml -isystem /usr/include/qt5/QtCore -I. -I/usr/lib/qt5/mkspecs/cygwin-g++ -o testQDomDocument.o testQDomDocument
g++ -o testQDomDocument.exe testQDomDocument.o -lQt5Gui -lQt5Xml -lQt5Core -lGL -lpthread
Number of found <p> nodes: 1
node <p> # 0
node.toElement().text(): "Of course, \xE2\x80\x9CJason.\xE2\x80\x9D My thoughts, exactly."
QDomNode::TextNode
"Of course, \xE2\x80\x9CJason.\xE2\x80\x9D My thoughts, exactly."
$
中只有一个子节点,包含编码为 NCR 的引号的完整文本.
<p>
with the complete text including the quotes which are encoded as NCRs.\xE2\x80\x9C
和 \xE2\x80\x9D
让我有点不确定.(请注意,我添加了 .toUtf8()
来调试输出,因为我之前得到了 ?
和 ?
.)\xE2\x80\x9C
and \xE2\x80\x9D
made me a bit uncertain. (Please, note that I added .toUtf8()
to debug output because I got ?
and ?
before.)
但为什么要逃跑?
我的 bash
的 LANG
设置错误?
But why the escaping?
Wrong LANG
setting of my bash
?$ ./testQDomDocument 2>&1 | hexdump -C
00000000 4e 75 6d 62 65 72 20 6f 66 20 66 6f 75 6e 64 20 |Number of found |
00000010 3c 70 3e 20 6e 6f 64 65 73 3a 20 31 0a 6e 6f 64 |<p> nodes: 1.nod|
00000020 65 20 3c 70 3e 20 23 20 30 0a 6e 6f 64 65 2e 74 |e <p> # 0.node.t|
00000030 6f 45 6c 65 6d 65 6e 74 28 29 2e 74 65 78 74 28 |oElement().text(|
00000040 29 3a 20 20 22 4f 66 20 63 6f 75 72 73 65 2c 20 |): "Of course, |
00000050 5c 78 45 32 5c 78 38 30 5c 78 39 43 4a 61 73 6f |\xE2\x80\x9CJaso|
00000060 6e 2e 5c 78 45 32 5c 78 38 30 5c 78 39 44 20 4d |n.\xE2\x80\x9D M|
00000070 79 20 74 68 6f 75 67 68 74 73 2c 20 65 78 61 63 |y thoughts, exac|
00000080 74 6c 79 2e 22 0a 51 44 6f 6d 4e 6f 64 65 3a 3a |tly.".QDomNode::|
00000090 54 65 78 74 4e 6f 64 65 0a 22 4f 66 20 63 6f 75 |TextNode."Of cou|
000000a0 72 73 65 2c 20 5c 78 45 32 5c 78 38 30 5c 78 39 |rse, \xE2\x80\x9|
000000b0 43 4a 61 73 6f 6e 2e 5c 78 45 32 5c 78 38 30 5c |CJason.\xE2\x80\|
000000c0 78 39 44 20 4d 79 20 74 68 6f 75 67 68 74 73 2c |x9D My thoughts,|
000000d0 20 65 78 61 63 74 6c 79 2e 22 0a | exactly.".|
000000db
$
qDebug()
引起的,它转义了所有值为 128 及以上的字节.qDebug()
which escapes all bytes with values of 128 and above.
更多推荐
[db:关键词]
发布评论