建立
我正在使用这里给出的scrapy示例刮掉房屋广告。
在我的情况下,我跟随住房广告页面的链接而不是作者页面,然后取消住房广告页面的信息。
问题
我的代码成功地遵循住房广告页面的链接并为每个广告提供信息。 但是,它只能用于最初的页面,即它不遵循分页链接。
代码到目前为止 class RoomsSpider(scrapy.Spider): name = 'rooms' start_urls = ['https://www.spareroom.co.uk/flatshare/london'] def parse(self, response): # follow links to ad pages for href in response.xpath( '//*[@id="maincontent"]/ul/li/article/header[1]', ).css('a::attr(href)').extract(): yield scrapy.Request(response.urljoin(href), callback=self.parse_ad) # follow pagination links next_page = response.xpath( '//*[@id="maincontent"]/div[2]/ul[2]/li/strong/a/@href', ).extract() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) def parse_ad(self, response): # code extracting ad information follows here, # finalising the code with a yield function.
所以,我基本上是遵循这个例子。 我在运行代码时没有收到有关分页链接部分的错误,并且查询路径是正确的(我相信)。
我是否已将# follow pagination links部分正确放置在代码中? 我迷路了。
Set-up
I'm scraping housing ads using the scrapy example given here.
In my case, I follow links to housing ad pages in stead of author pages, and subsequently scrap housing ad page for information.
Problem
My code successfully follows links to housing ad pages and scrapes the information per ad. However, it does so only for the initial page, i.e. it does not follow the pagination links.
Code so far class RoomsSpider(scrapy.Spider): name = 'rooms' start_urls = ['https://www.spareroom.co.uk/flatshare/london'] def parse(self, response): # follow links to ad pages for href in response.xpath( '//*[@id="maincontent"]/ul/li/article/header[1]', ).css('a::attr(href)').extract(): yield scrapy.Request(response.urljoin(href), callback=self.parse_ad) # follow pagination links next_page = response.xpath( '//*[@id="maincontent"]/div[2]/ul[2]/li/strong/a/@href', ).extract() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) def parse_ad(self, response): # code extracting ad information follows here, # finalising the code with a yield function.
So, I am basically following the example. I do not receive an error regarding the pagination link part when running the code, and the query path is correct (I believe).
Have I placed the # follow pagination links part correctly in the code? I'm lost.
最满意答案
似乎是一个愚蠢的错误:
for href in response.xpath( '//*[@id="maincontent"]/ul/li/article/header[1]', ).css('a::attr(href)').extract():提供包含分页href的单元素列表,例如['\href'] 。 但要使代码正常工作,需要一个字符串,例如'\href' 。 因此,在上面的代码片段中,将extract()替换为extract()[0] 。
Appeared to be a silly mistake:
for href in response.xpath( '//*[@id="maincontent"]/ul/li/article/header[1]', ).css('a::attr(href)').extract():provides a one-element list containing the pagination href, e.g. ['\href']. But for the code to work, a string is needed, e.g. '\href'. Therefore, in the code snippet above, replace extract() with extract()[0].
更多推荐
发布评论