Scrapy:遵循分页链接不工作(Scrapy: follow pagination links not working)

编程入门 行业动态 更新时间:2024-10-25 10:20:55
Scrapy:遵循分页链接不工作(Scrapy: follow pagination links not working)

建立

我正在使用这里给出的scrapy示例刮掉房屋广告。

在我的情况下,我跟随住房广告页面的链接而不是作者页面,然后取消住房广告页面的信息。


问题

我的代码成功地遵循住房广告页面的链接并为每个广告提供信息。 但是,它只能用于最初的页面,即它不遵循分页链接。


代码到目前为止

class RoomsSpider(scrapy.Spider): name = 'rooms' start_urls = ['https://www.spareroom.co.uk/flatshare/london'] def parse(self, response): # follow links to ad pages for href in response.xpath( '//*[@id="maincontent"]/ul/li/article/header[1]', ).css('a::attr(href)').extract(): yield scrapy.Request(response.urljoin(href), callback=self.parse_ad) # follow pagination links next_page = response.xpath( '//*[@id="maincontent"]/div[2]/ul[2]/li/strong/a/@href', ).extract() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) def parse_ad(self, response): # code extracting ad information follows here, # finalising the code with a yield function.

所以,我基本上是遵循这个例子。 我在运行代码时没有收到有关分页链接部分的错误,并且查询路径是正确的(我相信)。

我是否已将# follow pagination links部分正确放置在代码中? 我迷路了。

Set-up

I'm scraping housing ads using the scrapy example given here.

In my case, I follow links to housing ad pages in stead of author pages, and subsequently scrap housing ad page for information.


Problem

My code successfully follows links to housing ad pages and scrapes the information per ad. However, it does so only for the initial page, i.e. it does not follow the pagination links.


Code so far

class RoomsSpider(scrapy.Spider): name = 'rooms' start_urls = ['https://www.spareroom.co.uk/flatshare/london'] def parse(self, response): # follow links to ad pages for href in response.xpath( '//*[@id="maincontent"]/ul/li/article/header[1]', ).css('a::attr(href)').extract(): yield scrapy.Request(response.urljoin(href), callback=self.parse_ad) # follow pagination links next_page = response.xpath( '//*[@id="maincontent"]/div[2]/ul[2]/li/strong/a/@href', ).extract() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) def parse_ad(self, response): # code extracting ad information follows here, # finalising the code with a yield function.

So, I am basically following the example. I do not receive an error regarding the pagination link part when running the code, and the query path is correct (I believe).

Have I placed the # follow pagination links part correctly in the code? I'm lost.

最满意答案

似乎是一个愚蠢的错误:

for href in response.xpath( '//*[@id="maincontent"]/ul/li/article/header[1]', ).css('a::attr(href)').extract():

提供包含分页href的单元素列表,例如['\href'] 。 但要使代码正常工作,需要一个字符串,例如'\href' 。 因此,在上面的代码片段中,将extract()替换为extract()[0] 。

Appeared to be a silly mistake:

for href in response.xpath( '//*[@id="maincontent"]/ul/li/article/header[1]', ).css('a::attr(href)').extract():

provides a one-element list containing the pagination href, e.g. ['\href']. But for the code to work, a string is needed, e.g. '\href'. Therefore, in the code snippet above, replace extract() with extract()[0].

更多推荐

本文发布于:2023-07-21 09:33:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1208988.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:分页   链接   工作   Scrapy   working

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!