如何以编程方式安排 Scrapy 爬网执行

编程入门行业动态更新时间:2024-10-24 08:23:06

本文介绍了如何以编程方式安排 Scrapy 爬网执行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我想创建一个调度程序脚本来按顺序多次运行同一个蜘蛛.

I want to create a scheduler script to run the same spider multiple times in a sequence.

到目前为止，我得到了以下内容:

So far I got the following:

#!/usr/bin/python3 """Scheduler for spiders.""" import time from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from my_project.spiders.deals import DealsSpider def crawl_job(): """Job to start spiders.""" settings = get_project_settings() process = CrawlerProcess(settings) process.crawl(DealsSpider) process.start() # the script will block here until the end of the crawl if __name__ == '__main__': while True: crawl_job() time.sleep(30) # wait 30 seconds then crawl again

现在蜘蛛第一次正确执行，然后在时间延迟后，蜘蛛再次启动，但就在它开始抓取之前，我收到以下错误消息:

For now the first time the spider executes properly, then after the time delay, the spider starts up again but right before it would start scraping I get the following error message:

Traceback (most recent call last): File "scheduler.py", line 27, in <module> crawl_job() File "scheduler.py", line 17, in crawl_job process.start() # the script will block here until the end of the crawl File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 285, in start reactor.run(installSignalHandlers=False) # blocking call File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1193, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1173, in startRunning ReactorBase.startRunning(self) File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable

不幸的是，我不熟悉 Twisted 框架和它的 Reactor，所以任何帮助将不胜感激！

Unfortunately I'm not familiar with the Twisted framework and its Reactors, so any help would be appreciated!

推荐答案

您收到 ReactorNotRestartable 错误，因为 Reactor 无法在 Twisted 中多次启动.基本上，每次 process.start() 被调用时，它都会尝试启动反应器.网上有很多关于这方面的信息.这是一个简单的解决方案:

You're getting the ReactorNotRestartable error because the Reactor cannot be started multiple times in Twisted. Basically, each time process.start() is called, it will try to start the reactor. There's plenty of information around the web about this. Here's a simple solution:

from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings from my_project.spiders.deals import DealsSpider def crawl_job(): """ Job to start spiders. Return Deferred, which will execute after crawl has completed. """ settings = get_project_settings() runner = CrawlerRunner(settings) return runner.crawl(DealsSpider) def schedule_next_crawl(null, sleep_time): """ Schedule the next crawl """ reactor.callLater(sleep_time, crawl) def crawl(): """ A "recursive" function that schedules a crawl 30 seconds after each successful crawl. """ # crawl_job() returns a Deferred d = crawl_job() # call schedule_next_crawl(<scrapy response>, n) after crawl job is complete d.addCallback(schedule_next_crawl, 30) d.addErrback(catch_error) def catch_error(failure): print(failure.value) if __name__=="__main__": crawl() reactor.run()

与您的代码段有一些明显的不同.reactor 被直接调用，用 CrawlerProcess 代替 CrawlerRunner，time.sleep 已被移除，因此反应器不会't 阻塞，while 循环已被替换为通过 callLater 对 crawl 函数的连续调用.它很短，应该做你想做的.如果有任何部分让您感到困惑，请告诉我，我会详细说明.

There are a few noticeable differences from your snippet. The reactor is directly called, substitute CrawlerProcess for CrawlerRunner, time.sleep has been removed so that the reactor doesn't block, the while loop has been replaced with a continuous call to the crawl function via callLater. It's short and should do what you want. If any parts confuse you, let me know and I'll elaborate.

import datetime as dt def schedule_next_crawl(null, hour, minute): tomorrow = ( dt.datetime.now() + dt.timedelta(days=1) ).replace(hour=hour, minute=minute, second=0, microsecond=0) sleep_time = (tomorrow - dt.datetime.now()).total_seconds() reactor.callLater(sleep_time, crawl) def crawl(): d = crawl_job() # crawl everyday at 1pm d.addCallback(schedule_next_crawl, hour=13, minute=30)

更多推荐

如何以编程方式安排 Scrapy 爬网执行

本文发布于:2023-11-26 00:09:03，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1631886.html