scrapy之盗墓笔记三级页面爬取

编程入门 行业动态 更新时间:2024-10-09 16:23:20

scrapy之盗墓笔记三级<a href=https://www.elefans.com/category/jswz/34/1771336.html style=页面爬取"/>

scrapy之盗墓笔记三级页面爬取

#今日目标**scrapy之盗墓笔记三级页面爬取**今天要爬取的是盗墓笔记小说,由分析该小说的主要内容在三级页面里,故需要我们
一一解析*代码实现*daomu.py```
import scrapy
from ..items import DaomuItemclass DaomuSpider(scrapy.Spider):name = 'daomu'allowed_domains = ['daomubiji']start_urls = ['/']# 解析一级页面的parse函数def parse(self, response):# link_list: ['http://xxx/dao-mu-bi-ji-1','','','']link_list = response.xpath('//ul[@class="sub-menu"]/li/a/@href').extract()for link in link_list:# 交给调度器yield scrapy.Request(url = link,callback = self.parse_two_html)# 解析二级页面函数(圈名 章节数 章节名 链接)def parse_two_html(self,response):# 基准xpatharticle_list = response.xpath('//article')for article in article_list:# 创建item对象item = DaomuItem()# info_list: ['七星鲁王','第一章','血尸']info_list = article.xpath('./a/text()').get().split()if len(info_list) == 3:item['volume_name'] = info_list[0]item['zh_num'] = info_list[1]item['zh_name'] = info_list[2]else:item['volume_name'] = info_list[0]item['zh_name'] = info_list[1]item['zh_num'] = ''# 提取链接并发给调度器入队列item['zh_link'] = article.xpath('./a/@href').get()yield scrapy.Request(url = item['zh_link'],# meta参数: 传递item对象到下一个解析函数meta = {'item':item},callback = self.parse_three_html)# 解析三级页面(小说内容)函数def parse_three_html(self,response):# 获取上一个函数传递过来的item对象item = response.meta['item']# content_list: ['段落1','段落2','','']content_list = response.xpath('//article[@class="article-content"]//p/text()').extract()item['zh_content'] = '\n'.join(content_list)yield item```items.py```import scrapyclass DaomuItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 卷名volume_name = scrapy.Field()# 章节数zh_num = scrapy.Field()# 章节名称zh_name = scrapy.Field()# 章节链接zh_link = scrapy.Field()# 小说内容zh_content = scrapy.Field()```pipelines.py```
class DaomuPipeline(object):def process_item(self, item, spider):filename = '/home/tarena/daomu/{}_{}_{}'.format(item['volume_name'],item['zh_num'],item['zh_name'])with open(filename,'w') as f:f.write(item['zh_content'])return item```

 

转载于:.html

更多推荐

scrapy之盗墓笔记三级页面爬取

本文发布于:2024-03-23 19:30:09,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1741938.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:页面   笔记   scrapy

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!