重定向后的Scrapy回调(Scrapy callback after redirect)

编程入门 行业动态 更新时间:2024-10-27 16:30:19
重定向后的Scrapy回调(Scrapy callback after redirect)

我有一个非常基本的scrapy spider,它从文件中抓取url并下载它们。 唯一的问题是,他们中的一些被重定向到同一个域内稍微修改过的url。 我想在使用response.meta的回调函数中获取它们,并且它在一个普通的url上工作,但是url被重定向回调似乎没有被调用。 我该如何解决它? 这是我的代码。

from scrapy.contrib.spiders import CrawlSpider from scrapy import log from scrapy import Request class DmozSpider(CrawlSpider): name = "dmoz" handle_httpstatus_list = [302] allowed_domains = ["http://www.exmaple.net/"]) f = open("C:\\python27\\1a.csv",'r') url = 'http://www.exmaple.net/Query?indx=' start_urls = [url+row for row in f.readlines()] def parse(self, response): print response.meta.get('redirect_urls', [response.url]) print response.status print (response.headers.get('Location'))

我也尝试过这样的事情:

def parse(self, response): return Request(response.url, meta={'dont_redirect': True, 'handle_httpstatus_list': [302]}, callback=self.parse_my_url) def parse_my_url(self, response): print response.status print (response.headers.get('Location'))

而且它也不起作用。

I have a very basic scrapy spider, which grabs urls from the file and then downloads them. The only problem is that some of them got redirected to a slightly modified url within same domain. I want to get them in my callback function using response.meta, and it works on a normal urls, but then url is redirected callback doesn't seem to get called. How can I fix it? Here's my code.

from scrapy.contrib.spiders import CrawlSpider from scrapy import log from scrapy import Request class DmozSpider(CrawlSpider): name = "dmoz" handle_httpstatus_list = [302] allowed_domains = ["http://www.exmaple.net/"]) f = open("C:\\python27\\1a.csv",'r') url = 'http://www.exmaple.net/Query?indx=' start_urls = [url+row for row in f.readlines()] def parse(self, response): print response.meta.get('redirect_urls', [response.url]) print response.status print (response.headers.get('Location'))

I've also tried something like that:

def parse(self, response): return Request(response.url, meta={'dont_redirect': True, 'handle_httpstatus_list': [302]}, callback=self.parse_my_url) def parse_my_url(self, response): print response.status print (response.headers.get('Location'))

And it doesn't work either.

最满意答案

默认情况下,scrapy请求会被重定向,但如果您不想重定向,则可以使用start_requests方法并在请求元中添加标志。

def start_requests(self): requests =[(Request(self.url+u, meta={'handle_httpstatus_list': [302], 'dont_redirect': True}, callback=self.parse)) for u in self.start_urls] return requests

By default scrapy requests are redirected, although if you don't want to redirect you can do like this, use start_requests method and add flags in request meta.

def start_requests(self): requests =[(Request(self.url+u, meta={'handle_httpstatus_list': [302], 'dont_redirect': True}, callback=self.parse)) for u in self.start_urls] return requests

更多推荐

本文发布于:2023-08-07 07:13:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1462914.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:回调   重定向   redirect   Scrapy   callback

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!