admin管理员组

文章数量:1644402

概述

在日常爬虫过程中,我们还要爬取当前页面的分页页面,这种情况下,普通爬虫方式已经不行了,所有今天来尝试子页面的爬取

开始工作

1.创建项目:

scrapy startproject pqejym
复制代码

2.创建爬虫器:

cd pqejym
scrapy genspider btdy www.btbtdy
复制代码

3.打开PyCharm

通过PyCharm打开项目目录

4.设置setting.py文件

ROBOTSTXT_OBEY = False  
##这是爬虫规则,我们选择False不遵守,可以爬取更多东西
复制代码
USER_AGENT =  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36'
##这是请求头信息的USER_AGENT,我们给它设置成这样,改制可以通过浏览器的开发者工具获取
复制代码

5.编写爬虫文件

分析:因为是爬取子页面,我们当前的起始url是当前页面,要想获取子页面,我们得拿到子页面的链接,然后再去解析获取子页面的内容 先写个函数获取子页面链接:

 def parse(self, response):
        links = response.xpath('//div[@class="cts_ms"]/p/a/@href')
        for link in links:
            print(link.extract())
            yield response.follow(link,self.parse_content)
复制代码

我们通过xpath解析,拿到链接标签,然后通过循环遍历,follow是scrapy的内置方法, 对scrapy中使用yield循环处理网页url的分析

首先,scrapy框架对含有yield关键字的parse()方法的调用是以迭代的方式进行的。相当于

for n in parse(self, response):
        pass
复制代码

其次,python将parse()函数视为生成器,但首次调用才会开始执行代码,每次迭代请求(即上面的for循环)才会执行yield处的循环代码,生成每次迭代的值。 我们试着运行:

 scrapy crawl btdy
复制代码

部分结果展示:

/btdy/dy10862.html
/btdy/dy10598.html
/btdy/dy10186.html
/btdy/dy10216.html
/btdy/dy9749.html
/btdy/dy8611.html
/btdy/dy11748.html
/btdy/dy6403.html
/btdy/dy5165.html
/btdy/dy6219.html
/btdy/dy5164.html
/btdy/dy4356.html
/btdy/dy1670.html
/btdy/dy1669.html
/btdy/dy1668.html
复制代码

接下来我们写个函数解析子页面内容:

def parse_content(self,response):
        print(response.xpath('//title'))
        movie = PqejymItem()
        title = response.xpath('//h1/text()').extract()
        content = response.xpath('//div[@class="c05"]/span/text()').extract()
        magnet = response.xpath('//*[@id="nucms_downlist"]/div[2]/ul/li/span/a/@href').extract()
        movie['title'] = title
        movie['content'] = content
        movie['magnet'] = magnet
        yield movie

复制代码

我们的在items中定义一下:

import scrapy


class PqejymItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    magnet = scrapy.Field()

复制代码

总体btdy代码如下:

import scrapy
from pqejym.items import PqejymItem


class BtdySpider(scrapy.Spider):
    name = 'btdy'
    allowed_domains = ['www.btbtdy']
    start_urls = ['http://www.btbtdy/']

    def parse(self, response):
        links = response.xpath('//div[@class="cts_ms"]/p/a/@href')
        for link in links:
            print(link.extract())
            yield response.follow(link,self.parse_content)
    def parse_content(self,response):
        print(response.xpath('//title'))
        movie = PqejymItem()
        title = response.xpath('//h1/text()').extract()
        content = response.xpath('//div[@class="c05"]/span/text()').extract()
        magnet = response.xpath('//*[@id="nucms_downlist"]/div[2]/ul/li/span/a/@href').extract()
        movie['title'] = title
        movie['content'] = content
        movie['magnet'] = magnet
        yield movie
复制代码

6.运行文件

scrapy crawl btdy
复制代码

部分结果

2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13375.html>
{'content': [], 'magnet': [], 'title': ['那些年,我们正年轻']}
2018-08-24 17:53:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy/btdy/dy13350.html> (referer: http://www.btbtdy/)
[<Selector xpath='//title' data='<title>爱情进化论全集-高清BT种子下载_迅雷下载-BT电影天堂</tit'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13315.html>
{'content': [], 'magnet': [], 'title': ['爱情进化论']}
2018-08-24 17:53:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy/btdy/dy13379.html> (referer: http://www.btbtdy/)
[<Selector xpath='//title' data='<title>天盛长歌全集-高清BT种子下载_迅雷下载-BT电影天堂</titl'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13350.html>
{'content': [], 'magnet': [], 'title': ['天盛长歌']}
[<Selector xpath='//title' data='<title>夜天子全集-高清BT种子下载_迅雷下载-BT电影天堂</title'>]
2018-08-24 17:53:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13379.html>
{'content': [], 'magnet': [], 'title': ['夜天子']}
2018-08-24 17:53:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.btbtdy/btdy/dy13243.html> (referer: http://www.btbtdy/)
[<Selector xpath='//title' data='<title>进击的巨人 第三季全集-高清BT种子下载_迅雷下载-BT电影天堂<'>]
2018-08-24 17:53:44 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.btbtdy/btdy/dy13243.html>
{'content': [], 'magnet': [], 'title': ['进击的巨人 第三季']}
2018-08-24 17:53:44 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-24 17:53:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 45111,

复制代码

本文标签: 爬取子页