所以我正在通过教程来玩scrapy。 我正在尝试使用那里提到的CSS选择器来刮取伴随网站中每个引用的文本,作者和标签:
for quote in response.css('div.quote'): print quote.css('span.text::text').extract() print quote.css('span small::text').extract() print quote.css('div.tags a.tag::text').extract()我得到了期望的结果(即:每个文本,作者和引号的打印一次)。 但是一旦使用像这样的Xpath选择器:
for quote in response.xpath("//*[@class='quote']"): print quote.xpath("//*[@class='text']/text()").extract() print quote.xpath("//*[@class='author']/text()").extract() print quote.xpath("//*[@class='tag']/text()").extract()我得到重复的结果!
我仍然无法找到为什么2之间存在这样的差异。
So I am playing around with scrapy through the tutorial. I am trying to scrape the text, author and tags of each quote in the companion website when using CSS selectors like mentioned there:
for quote in response.css('div.quote'): print quote.css('span.text::text').extract() print quote.css('span small::text').extract() print quote.css('div.tags a.tag::text').extract()I get the desired result (i.e: print of each text, author and quotes once). But once using Xpath selectors like this:
for quote in response.xpath("//*[@class='quote']"): print quote.xpath("//*[@class='text']/text()").extract() print quote.xpath("//*[@class='author']/text()").extract() print quote.xpath("//*[@class='tag']/text()").extract()I get duplicates results!
I still can't find why there is such a difference between the 2.
最满意答案
尝试.//而不是//用于相关搜索,例如
print quote.xpath(".//*[@class='text']/text()").extract()
当你使用// ,尽管你是从quote搜索的,但这意味着绝对搜索,因此它的上下文仍然是文档的根。 .//然而,意味着搜索. - 当前节点 - 此搜索的上下文将仅限于嵌套在quote下的元素。
作为旁注,如果您希望获得完全相同的结果,则应考虑将*更改为您在CSS搜索中使用的标记 - span或div 。 在这种情况下,它没有任何区别,只是为了将来的参考。
Try .// instead of // for your relative searches e.g.
print quote.xpath(".//*[@class='text']/text()").extract()
When you use //, although you're searching from quote, it takes this to mean an absolute search so its context is still the root of the document. .// however, means to search from . - the current node - and the context of this search will be limited to the elements nested under quote.
As a side note, if you're looking to get the exact same results, you should consider changing * to the tags you used in the CSS search - span or div. In this case it doesn't make any difference but just a head's up for future reference.
更多推荐
发布评论