爬虫学习:使用scrapy爬取猫眼电影

编程入门 行业动态 更新时间:2024-10-27 06:33:40

<a href=https://www.elefans.com/category/jswz/34/1770264.html style=爬虫学习:使用scrapy爬取猫眼电影"/>

爬虫学习:使用scrapy爬取猫眼电影

操作步骤

1.生成项目(在cmd或shell窗口运行以下3列代码)

scrapy startproject movieinfo
cd movieinfo
scrapy genspider maoyanm

生成文件结构如下:

 

2.相关文件内容编辑

maoyanm.py

# -*- coding: utf-8 -*-
import scrapy
from moviesinfo.items import MoviesinfoItemclass MaoyanmSpider(scrapy.Spider):name = 'maoyanm'allowed_domains = ['maoyan']start_urls = ['=3&offset={}'.format((n-1)*30) for n in range(1,500)]def parse(self, response):urls = response.xpath('//dd/div[2]/a/@href').extract()for url in urls:yield scrapy.Request(''+url, callback=self.parseContent)#print(''+url)def parseContent(self,response):names = response.xpath('/html/body/div[3]/div/div[2]/div[1]/h3/text()').extract()ennames = response.xpath('//div[@class="ename ellipsis"]/text()').extract()movietype = response.xpath('//li[@class="ellipsis"][1]/text()').extract()movietime = response.xpath('//li[@class="ellipsis"][2]/text()').extract()releasetime = response.xpath('//li[@class="ellipsis"][3]/text()').extract()print(str(names[0])+str(ennames[0]),movietype,movietime,releasetime)#实例化movieItem = MoviesinfoItem()movieItem['name'] = str(names[0])+' '+str(ennames[0])movieItem['movietype'] = movietype[0]movieItem['movietime'] = movietime[0].replace('\n','').replace(" ","")movieItem['releasetime'] = releasetime[0]yield movieItem

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# .htmlimport scrapyclass MoviesinfoItem(scrapy.Item):# define the fields for your item here like:name = scrapy.Field()movietype = scrapy.Field()movietime = scrapy.Field()releasetime = scrapy.Field()pass
pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: .html
import jsonclass MoviesinfoPipeline(object):def open_spider(self,spider):self.f = open('movies.json','a',encoding='utf-8')def close_spider(self,spider):self.f.close()def process_item(self, item, spider):data = json.dumps(dict(item),ensure_ascii=False)+'\n'self.f.write(data)return item

settings.py

ITEM_PIPELINES = {'moviesinfo.pipelines.MoviesinfoPipeline': 300,
}#找到这行代码去掉注释

 

修改user-agent(非必须选项)

安装fake_useragent(在cmd或shell窗口运行下面这列代码)
pip install fake_useragent

middlewares.py

#添加以下代码!!!!
import random
from fake_useragent import UserAgentclass RandomUserAgentMiddleware(object):#随机更换user-agentdef __init__(self,crawler):super(RandomUserAgentMiddleware,self).__init__()self.ua = UserAgent()self.ua_type = crawler.settings.get("RANDOM_UA_TYPE","random")@classmethoddef from_crawler(cls,crawler):return cls(crawler)def process_request(self,request,spider):def get_ua():return getattr(self.ua,self.ua_type)request.headers.setdefault('User-Agent',get_ua())

 

3.运行爬虫(在cmd或shell窗口运行下面这列代码)

scrapy crawl maoyanm

等待.........

 

ps.没有像预想中爬完 所有页面,后来发现到一定页数后页面不会显示,之后还要需要学习一些反爬机制解决问题,或者找一些反爬机制不完善的网页进行爬取。

 

参考资料:

.html

 

scrapy更换user-agent:

.html

 

转载于:.html

更多推荐

爬虫学习:使用scrapy爬取猫眼电影

本文发布于:2024-02-07 06:52:25,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1754026.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:爬虫   猫眼   电影   scrapy

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!