我正在构建一个内部读者风格的PHP应用程序,它从我们的页面中提取文本,然后对其进行各种操作。 大多数HTML页面上的文本都是无序的,因此应用程序必须能够在不使用类名或其他导航锚的情况下获取文本,因为没有。 只有文本标题与锚点相关。
我想从给定的起始节点(标题)中获取文本,然后在我到达img标签时停止(可能存在或不存在,如果没有,那么这将意味着应该获取所有文本)。 我目前只成功使用XPath获取没有图像的文本。
这是一个HTML示例
<b>Some title</b> <br/> Important text <br/> More important text <p> More text I which should be fetched</p> <p><img src="foo.jpg"/></p> <p> Unimportant text, don't want it!</p>这是我正在使用的XPath查询//*[text()="Some title"]/following::text() 。
以上确实提取了相关文本,但是如果它存在,我希望它停止在img标记处。 知道怎么做吗?
I'm building an in-house reader-style PHP app which fetches text from our pages and then does various manipulations on it. The text on most of our HTML pages is unordered so the app has to be able to grab text without using class names or other navigation anchors since there are none. Only the text title is relevant as an anchor.
I would like to fetch text from a given start node (the title) and then stop when I reach an img tag (which may or may not exist, if not then this would mean that all the text should be fetched). I've currently succeeded only in fetching the text without the image using XPath.
Here's a sample HTML
<b>Some title</b> <br/> Important text <br/> More important text <p> More text I which should be fetched</p> <p><img src="foo.jpg"/></p> <p> Unimportant text, don't want it!</p>This is the XPath query I'm currently using //*[text()="Some title"]/following::text().
The above indeed fetches the relevant text, however I would like it to stop at the img tag if it exists. Any idea how to do this?
最满意答案
获取不在图像之前的所有文本节点。
//*[text()="Some title"]/following::text()[not(preceding::img)]如果需要,您可以轻松地进一步限制停止的图像。
Fetch all text nodes that are not preceded by an image.
//*[text()="Some title"]/following::text()[not(preceding::img)]You can easily further restrict which images to stop at if needed.
更多推荐
发布评论