我在 scrapy 中编写了一个脚本,通过 get_proxies() 方法使用新生成的代理发出代理请求.我使用 requests 模块来获取代理,以便在脚本中重用它们.我想要做的是解析 登陆页面 中的所有电影链接,然后获取名称每部电影的目标页面.我的以下脚本可以使用代理轮换.
I've written a script in scrapy to make proxied requests using newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. What I'm trying to do is parse all the movie links from it's landing page and then fetch the name of each movie from it's target page. My following script can use rotation of proxies.
我知道有一种更简单的方法来更改代理,就像这里描述的那样HttpProxyMiddleware 但我仍然想坚持我在这里尝试的方式.
I know there is an easier way to change proxies, like it is described here HttpProxyMiddleware but I would still like to stick to the way I'm trying here.
网站链接
这是我目前的尝试(它一直使用新的代理来获取有效的响应,但每次它都会收到 503 Service Unavailable):
This is my current attempt (It keeps using new proxies to fetch a valid response but every time it gets 503 Service Unavailable):
import scrapy import random import requests from itertools import cycle from bs4 import BeautifulSoup from scrapy.crawler import CrawlerProcess def get_proxies(): response = requests.get("www.us-proxy/") soup = BeautifulSoup(response.text,"lxml") proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text] return proxy class ProxySpider(scrapy.Spider): name = "proxiedscript" handle_httpstatus_list = [503] proxy_vault = get_proxies() check_url = "yts.am/browse-movies" def start_requests(self): random.shuffle(self.proxy_vault) proxy_url = next(cycle(self.proxy_vault)) request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True) request.meta['https_proxy'] = f'{proxy_url}' yield request def parse(self,response): print(response.meta) if "DDoS protection by Cloudflare" in response.css(".attribution > a::text").get(): random.shuffle(self.proxy_vault) proxy_url = next(cycle(self.proxy_vault)) request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True) request.meta['https_proxy'] = f'{proxy_url}' yield request else: for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall(): nlink = response.urljoin(item) yield scrapy.Request(nlink,callback=self.parse_details) def parse_details(self,response): name = response.css("#movie-info h1::text").get() yield {"Name":name} if __name__ == "__main__": c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'}) c.crawl(ProxySpider) c.start()为了确定请求是否被代理,我打印了 response.meta 并且可以得到这样的结果 {'https_proxy': '142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}.
To make sure whether the request is being proxied, I printed response.meta and could get results like this {'https_proxy': '142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}.
由于我过度使用链接来检查scrapy 中的代理请求是如何工作的,此时我收到503 Service Unavailable 错误,我可以在响应 中看到这个关键字Cloudflare 的 DDoS 保护.但是,当我尝试使用 requests 模块应用我在此处实现的相同逻辑时,我得到了有效响应.
As I've overused the link to check how the proxied request within scrapy works, I'm getting 503 Service Unavailable error at this moment and I can see this keyword within the response DDoS protection by Cloudflare. However, I get valid response when I try with requests module applying the same logic I implemented here.
我之前的问题:为什么我无法得到有效的响应,因为(我想)我以正确的方式使用代理?[已解决]
悬赏问题:如何在我的脚本中定义 try/except 子句,以便它在与某个代理发生连接错误时尝试使用不同的代理?
Bounty Question: how can I define try/except clause within my script so that it will try with different proxies once it throws connection error with a certain proxy?
推荐答案根据 scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware docs(和来源)proxy 元密钥应该使用(不是 https_proxy)
According to scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware docs (and source) proxy meta key is expected to use (not https_proxy)
#request.meta['https_proxy'] = f'{proxy_url}' request.meta['proxy'] = f'{proxy_url}'由于scrapy 没有收到有效的元密钥 - 您的scrapy 应用程序没有使用代理
As scrapy didn't received valid meta key - your scrapy application didn't use proxies
更多推荐
在scrapy中使用try/except子句无法获得想要的结果
发布评论