admin管理员组文章数量:1644402
概述
在日常爬虫过程中,我们还要爬取当前页面的分页页面,这种情况下,普通爬虫方式已经不行了,所有今天来尝试子页面的爬取
开始工作
1.创建项目:
scrapy startproject pqejym
复制代码
2.创建爬虫器:
cd pqejym
scrapy genspider btdy www.btbtdy
复制代码
3.打开PyCharm
通过PyCharm打开项目目录
4.设置setting.py文件
ROBOTSTXT_OBEY = False
##这是爬虫规则,我们选择False不遵守,可以爬取更多东西
复制代码
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36'
##这是请求头信息的USER_AGENT,我们给它设置成这样,改制可以通过浏览器的开发者工具获取
复制代码
5.编写爬虫文件
分析:因为是爬取子页面,我们当前的起始url是当前页面,要想获取子页面,我们得拿到子页面的链接,然后再去解析获取子页面的内容 先写个函数获取子页面链接:
def parse(self, response):
links = response.xpath('//div[@class="cts_ms"]/p/a/@href')
for link in links:
print(link.extract())
yield response.follow(link,self.parse_content)
复制代码
我们通过xpath解析,拿到链接标签,然后通过循环遍历,follow是scrapy的内置方法, 对scrapy中使用yield循环处理网页url的分析
首先,scrapy框架对含有yield关键字的parse()方法的调用是以迭代的方式进行的。相当于
for n in parse(self, response):
pass
复制代码
其次,python将parse()函数视为生成器,但首次调用才会开始执行代码,每次迭代请求(即上面的for循环)才会执行yield处的循环代码,生成每次迭代的值。 我们试着运行:
scrapy crawl btdy
复制代码
部分结果展示:
/btdy/dy10862.html
/btdy/dy10598.html
/btdy/dy10186.html
/btdy/dy10216.html
/btdy/dy9749.html
/btdy/dy8611.html
/btdy/dy11748.html
/btdy/dy6403.html
/btdy/dy5165.html
/btdy/dy6219.html
/btdy/dy5164.html
/btdy/dy4356.html
/btdy/dy1670.html
/btdy/dy1669.html
/btdy/dy1668.html
复制代码
接下来我们写个函数解析子页面内容:
def parse_content(self,response):
print(response.xpath('//title'))
movie = PqejymItem()
title = response.xpath('//h1/text()').extract()
content = response.xpath('//div[@class="c05"]/span/text()').extract()
magnet = response.xpath('//*[@id="nucms_downlist"]/div[2]/ul/li/span/a/@href').extract()
movie['title'] = title
movie['content'] = content
movie['magnet'] = magnet
yield movie
复制代码
我们的在items中定义一下:
import scrapy
class PqejymItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
content = scrapy.Field()
magnet = scrapy.Field()
复制代码
总体btdy代码如下:
import scrapy
from pqejym.items import PqejymItem
class BtdySpider(scrapy.Spider):
name = 'btdy'
allowed_domains = ['www.btbtdy']
start_urls = ['http://www.btbtdy/']
def parse(self, response):
links = response.xpath('//div[@class="cts_ms"]/p/a/@href')
for link in links:
print(link.extract())
yield response.follow(link,self.parse_content)
def parse_content(self,response):
print(response.xpath('//title'))
movie = PqejymItem()
title = response.xpath('//h1/text()').extract()
content = response.xpath('//div[@class="c05"]/span/text()').extract()
magnet = response.xpath('//*[@id="nucms_downlist"]/div[2]/ul/li/span/a/@href').extract()
movie['title'] = title
movie['content'] = content
movie['magnet'] = magnet
yield movie
复制代码
6.运行文件
scrapy crawl btdy
复制代码
部分结果
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13375.html>
{'content': [], 'magnet': [], 'title': ['那些年,我们正年轻']}
2018-08-24 17:53:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy/btdy/dy13350.html> (referer: http://www.btbtdy/)
[<Selector xpath='//title' data='<title>爱情进化论全集-高清BT种子下载_迅雷下载-BT电影天堂</tit'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13315.html>
{'content': [], 'magnet': [], 'title': ['爱情进化论']}
2018-08-24 17:53:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy/btdy/dy13379.html> (referer: http://www.btbtdy/)
[<Selector xpath='//title' data='<title>天盛长歌全集-高清BT种子下载_迅雷下载-BT电影天堂</titl'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13350.html>
{'content': [], 'magnet': [], 'title': ['天盛长歌']}
[<Selector xpath='//title' data='<title>夜天子全集-高清BT种子下载_迅雷下载-BT电影天堂</title'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13379.html>
{'content': [], 'magnet': [], 'title': ['夜天子']}
2018-08-24 17:53:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy/btdy/dy13243.html> (referer: http://www.btbtdy/)
[<Selector xpath='//title' data='<title>进击的巨人 第三季全集-高清BT种子下载_迅雷下载-BT电影天堂<'>]
2018-08-24 17:53:44 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13243.html>
{'content': [], 'magnet': [], 'title': ['进击的巨人 第三季']}
2018-08-24 17:53:44 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-24 17:53:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 45111,
复制代码
本文标签: 爬取子页
版权声明:本文标题:爬取子页 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://www.elefans.com/dongtai/1729382533a1199207.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论