Scrapy不会抓取所有页面(Scrapy doesn't crawl all pages)

编程入门 行业动态 更新时间:2024-10-25 04:24:07
Scrapy不会抓取所有页面(Scrapy doesn't crawl all pages)

这是我的工作代码:

from scrapy.item import Item, Field class Test2Item(Item): title = Field() from scrapy.http import Request from scrapy.conf import settings from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule class Khmer24Spider(CrawlSpider): name = 'khmer24' allowed_domains = ['www.khmer24.com'] start_urls = ['http://www.khmer24.com/'] USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1" DOWNLOAD_DELAY = 2 rules = ( Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item', follow=True), ) def parse_item(self, response): hxs = HtmlXPathSelector(response) i = Test2Item() i['title'] = (hxs.select(('//div[@class="innerbox"]/h1/text()')).extract()[0]).strip(' \t\n\r') return i

它只能废弃10或15条记录。 总是随机数! 我无法设法获得具有http://www.khmer24.com/ad/any-words/67-anynumber.html等模式的所有页面

我真的怀疑Scrapy因为重复请求而完成了爬行。 他们建议使用dont_filter = True但是,我不知道将它放在我的代码中的哪个位置。 我是Scrapy的新手,真的需要帮助。

This is my working code:

from scrapy.item import Item, Field class Test2Item(Item): title = Field() from scrapy.http import Request from scrapy.conf import settings from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule class Khmer24Spider(CrawlSpider): name = 'khmer24' allowed_domains = ['www.khmer24.com'] start_urls = ['http://www.khmer24.com/'] USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1" DOWNLOAD_DELAY = 2 rules = ( Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item', follow=True), ) def parse_item(self, response): hxs = HtmlXPathSelector(response) i = Test2Item() i['title'] = (hxs.select(('//div[@class="innerbox"]/h1/text()')).extract()[0]).strip(' \t\n\r') return i

It can scrap only 10 or 15 records. Always random numbers! I can't manage to get all pages that has the pattern like http://www.khmer24.com/ad/any-words/67-anynumber.html

I really suspect that Scrapy finished crawling because of duplicate the request. They have suggested to use dont_filter = True however, I have no idea of where to put it in my code. I'm a newbie to Scrapy and really need help.

最满意答案

1.“他们建议使用dont_filter = True但是,我不知道将它放在我的代码中的哪个位置。”

这个参数在BaseSpider中,CrawlSpider继承自。 ( scrapy / spider.py )默认设置为True。

“它只能废弃10或15条记录。”

原因 :这是因为start_urls并不是那么好。 在这个问题中,蜘蛛开始在http://www.khmer24.com/中爬行,让我们假设它有10个URL跟随(这满足模式)。 然后,蜘蛛继续爬行这10个网址。 但是由于这些页面包含如此少的满意模式,蜘蛛会得到一些网址(甚至没有网址),这会导致停止爬行。

可能的解决方案 :我上面所说的原因只是重申了冰雪的意见。 解决方案也是如此。

建议使用“所有广告”页面作为start_urls 。 (您也可以将主页用作start_urls并使用新规则 。)

规则

rules = ( # Extract all links and follow links from them # (since no callback means follow=True by default) # (If "allow" is not given, it will match all links.) Rule(SgmlLinkExtractor()), # Extract links matching the "ad/any-words/67-anynumber.html" pattern # and parse them with the spider's method parse_item (NOT FOLLOW THEM) Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item'), )

参考: SgmlLinkExtractor , CrawlSpider示例

1."They have suggested to use dont_filter = True however, I have no idea of where to put it in my code."

This argument is in BaseSpider, which CrawlSpider inherits from. (scrapy/spider.py) And it's set as True by default.

2."It can scrap only 10 or 15 records."

Reason: This is because the start_urls is not that good. In this problem, the spider starts crawling in http://www.khmer24.com/, and let's assume that it gets 10 urls to follow(which are satisfied the pattern). And then, the spider goes on crawling these 10 urls. But as these pages contain so little satisfied pattern, the spider gets a few urls to follow(even no urls), which results in stopping crawling.

Possible solution: The reason what I said above just restates icecrime's opinion. And so does the solution.

Suggest to use the 'All ads' page as start_urls. (You could also use the home page as start_urls and use the new rules.)

New rules:

rules = ( # Extract all links and follow links from them # (since no callback means follow=True by default) # (If "allow" is not given, it will match all links.) Rule(SgmlLinkExtractor()), # Extract links matching the "ad/any-words/67-anynumber.html" pattern # and parse them with the spider's method parse_item (NOT FOLLOW THEM) Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item'), )

Refer: SgmlLinkExtractor, CrawlSpider example

更多推荐

本文发布于:2023-04-29 09:07:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1335885.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:页面   Scrapy   pages   crawl

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!