网络爬虫(六)

编程入门 行业动态 更新时间:2024-10-17 07:29:05

网络<a href=https://www.elefans.com/category/jswz/34/1770264.html style=爬虫(六)"/>

网络爬虫(六)

抓取猫眼电影排行:

目标:提取出猫眼电影排行前100位的相关内容。request比urllib好用,所以暂时使用request,目前采用正则表达式作为解析工具。

在下方还有分页。观察首页的网址为:

  

点击第二页:

=10
=20

发现后面均多出一个参数就是offset=10,并且每一次之后都是额外的增加10,所以初步推断这是一个偏移量的参数;

规律为offset代表偏移量的值,如果偏移量为n,那么电影的序号就是n+1到n+10,每页仅显示10部电影。所以想获取所有的前100名电影的话,就需要分开请求10次,然后使用正则提取出相关的信息即可。

抓取首页

import requestsdef get_one_page(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/55.0.2883.87 Safari/537.36'}response = requests.get(url, headers=headers)if response.status_code == 200:return response.textreturn Nonedef main():url = ''html = get_one_page(url)print(html)if __name__ == '__main__':main()

得到的结果如下:

<!DOCTYPE html><!--[if IE 8]><html class="ie8"><![endif]-->
<!--[if IE 9]><html class="ie9"><![endif]-->
<!--[if gt IE 9]><!--><html><!--<![endif]-->
<head><title>TOP100榜 - 猫眼电影 - 一网打尽好电影</title><link rel="dns-prefetch" href="//p0.meituan"  /><link rel="dns-prefetch" href="//p1.meituan"  /><link rel="dns-prefetch" href="//ms0.meituan" /><link rel="dns-prefetch" href="//ms1.meituan" /><link rel="dns-prefetch" href="//analytics.meituan" /><link rel="dns-prefetch" href="//report.meituan" /><link rel="dns-prefetch" href="//frep.meituan" /><meta charset="utf-8"><meta name="keywords" content="猫眼电影,电影排行榜,热映口碑榜,最受期待榜,国内票房榜,北美票房榜,猫眼TOP100"><meta name="description" content="猫眼电影热门榜单,包括热映口碑榜,最受期待榜,国内票房榜,北美票房榜,猫眼TOP100,多维度为用户进行选片决策"><meta http-equiv="cleartype" content="yes" /><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta name="renderer" content="webkit" /><meta name="HandheldFriendly" content="true" /><meta name="format-detection" content="email=no" /><meta name="format-detection" content="telephone=no" /><meta name="viewport" content="width=device-width, initial-scale=1"><script>cid = "c_wx6zb55";ci = 10;
val = {"subnavId":4};    window.system = {};window.openPlatform = '';window.openPlatformSub = '';</script><link rel="stylesheet" href="//ms0.meituan/mywww/common.4b838ec3.css"/>
<link rel="stylesheet" href="//ms0.meituan/mywww/board-index.92a06072.css"/><script src="//ms0.meituan/mywww/stat.74891044.js"></script><script>if(window.devicePixelRatio >= 2) { document.write('<link rel="stylesheet" href="//ms0.meituan/mywww/image-2x.8ba7074d.css"/>') }</script><style>@font-face {font-family: stonefont;src: url('//vfile.meituan/colorstone/c9da9f1236714d40f1f6b5356268c67d3168.eot');src: url('//vfile.meituan/colorstone/c9da9f1236714d40f1f6b5356268c67d3168.eot?#iefix') format('embedded-opentype'),url('//vfile.meituan/colorstone/c7296cfa3dd2560be8c413a808900a572080.woff') format('woff');}.stonefont {font-family: stonefont;}</style>
</head>
<body><div class="header"><div class="header-inner"><a href="/" class="logo" data-act="icon-click"></a><div class="city-container" data-val="{currentcityid:10 }"><div class="city-selected"><div class="city-name">上海<span class="caret"></span></div></div><div class="city-list" data-val="{ localcityid: 10 }"><div class="city-list-header">定位城市:<a class="js-geo-city">上海</a></div></div></div><div class="nav"><ul class="navbar"><li><a href="/" data-act="home-click"  >首页</a></li><li><a href="/films" data-act="movies-click" >电影</a></li><li><a href="/cinemas" data-act="cinemas-click" >影院</a></li> <li><a href="/board" data-act="board-click"  class="active" >榜单</a></li><li><a href="/news" data-act="hotNews-click" >热点</a></li></ul></div><div class="user-info"><div class="user-avatar J-login"><img src=".png"><span class="caret"></span><ul class="user-menu"><li><a href="javascript:void 0">登录</a></li></ul></div></div><form action="/query" target="_blank" class="search-form" data-actform="search-click"><input name="kw" class="search" type="search" maxlength="32" placeholder="找影视剧、影人、影院" autocomplete="off"><input class="submit" type="submit" value=""></form><div class="app-download"><a href="/app" target="_blank"><span class="iphone-icon"></span><span class="apptext">APP下载</span><span class="caret"></span><div class="download-icon"><p class="down-title">扫码下载APP</p><p class='down-content'>选座更优惠</p></div></a></div></div>
</div>
<div class="header-placeholder"></div><div class="subnav"><ul class="navbar"><li><a data-act="subnav-click" data-val="{subnavClick:7}"href="/board/7">热映口碑榜</a></li><li><a data-act="subnav-click" data-val="{subnavClick:6}"href="/board/6">最受期待榜</a></li><li><a data-act="subnav-click" data-val="{subnavClick:1}"href="/board/1">国内票房榜</a></li><li><a data-act="subnav-click" data-val="{subnavClick:2}"href="/board/2">北美票房榜</a></li><li><a data-act="subnav-click" data-val="{subnavClick:4}"data-state-val="{subnavId:4}"class="active" href="javascript:void(0);">TOP100榜</a></li></ul>
</div><div class="container" id="app" class="page-board/index" ><div class="content"><div class="wrapper"><div class="main"><p class="update-time">2018-08-11<span class="has-fresh-text">已更新</span></p><p class="board-content">榜单规则:将猫眼电影库中的经典影片,按照评分和评分人数从高到低综合排序取前100名,每天上午10点更新。相关数据来源于“猫眼电影库”。</p><dl class="board-wrapper"><dd><i class="board-index board-index-1">1</i><a href="/films/1203" title="霸王别姬" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="霸王别姬" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/1203" title="霸王别姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王别姬</a></p><p class="star">主演:张国荣,张丰毅,巩俐</p>
<p class="releasetime">上映时间:1993-01-01(中国香港)</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">6</i></p>        </div></div></div></dd><dd><i class="board-index board-index-2">2</i><a href="/films/1297" title="肖申克的救赎" class="image-link" data-act="boarditem-click" data-val="{movieId:1297}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="肖申克的救赎" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/1297" title="肖申克的救赎" data-act="boarditem-click" data-val="{movieId:1297}">肖申克的救赎</a></p><p class="star">主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿</p>
<p class="releasetime">上映时间:1994-10-14(美国)</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>        </div></div></div></dd><dd><i class="board-index board-index-3">3</i><a href="/films/2641" title="罗马假日" class="image-link" data-act="boarditem-click" data-val="{movieId:2641}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="罗马假日" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/2641" title="罗马假日" data-act="boarditem-click" data-val="{movieId:2641}">罗马假日</a></p><p class="star">主演:格利高里·派克,奥黛丽·赫本,埃迪·艾伯特</p>
<p class="releasetime">上映时间:1953-09-02(美国)</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">1</i></p>        </div></div></div></dd><dd><i class="board-index board-index-4">4</i><a href="/films/4055" title="这个杀手不太冷" class="image-link" data-act="boarditem-click" data-val="{movieId:4055}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="这个杀手不太冷" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/4055" title="这个杀手不太冷" data-act="boarditem-click" data-val="{movieId:4055}">这个杀手不太冷</a></p><p class="star">主演:让·雷诺,加里·奥德曼,娜塔莉·波特曼</p>
<p class="releasetime">上映时间:1994-09-14(法国)</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>        </div></div></div></dd><dd><i class="board-index board-index-5">5</i><a href="/films/1247" title="教父" class="image-link" data-act="boarditem-click" data-val="{movieId:1247}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="教父" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/1247" title="教父" data-act="boarditem-click" data-val="{movieId:1247}">教父</a></p><p class="star">主演:马龙·白兰度,阿尔·帕西诺,詹姆斯·肯恩</p>
<p class="releasetime">上映时间:1972-03-24(美国)</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">3</i></p>        </div></div></div></dd><dd><i class="board-index board-index-6">6</i><a href="/films/267" title="泰坦尼克号" class="image-link" data-act="boarditem-click" data-val="{movieId:267}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="泰坦尼克号" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/267" title="泰坦尼克号" data-act="boarditem-click" data-val="{movieId:267}">泰坦尼克号</a></p><p class="star">主演:莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩</p>
<p class="releasetime">上映时间:1998-04-03</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>        </div></div></div></dd><dd><i class="board-index board-index-7">7</i><a href="/films/123" title="龙猫" class="image-link" data-act="boarditem-click" data-val="{movieId:123}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="龙猫" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/123" title="龙猫" data-act="boarditem-click" data-val="{movieId:123}">龙猫</a></p><p class="star">主演:日高法子,坂本千夏,糸井重里</p>
<p class="releasetime">上映时间:1988-04-16(日本)</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">2</i></p>        </div></div></div></dd><dd><i class="board-index board-index-8">8</i><a href="/films/837" title="唐伯虎点秋香" class="image-link" data-act="boarditem-click" data-val="{movieId:837}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="唐伯虎点秋香" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/837" title="唐伯虎点秋香" data-act="boarditem-click" data-val="{movieId:837}">唐伯虎点秋香</a></p><p class="star">主演:周星驰,巩俐,郑佩佩</p>
<p class="releasetime">上映时间:1993-07-01(中国香港)</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">2</i></p>        </div></div></div></dd><dd><i class="board-index board-index-9">9</i><a href="/films/1212" title="千与千寻" class="image-link" data-act="boarditem-click" data-val="{movieId:1212}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="千与千寻" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/1212" title="千与千寻" data-act="boarditem-click" data-val="{movieId:1212}">千与千寻</a></p><p class="star">主演:柊瑠美,入野自由,夏木真理</p>
<p class="releasetime">上映时间:2001-07-20(日本)</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">3</i></p>        </div></div></div></dd><dd><i class="board-index board-index-10">10</i><a href="/films/2760" title="魂断蓝桥" class="image-link" data-act="boarditem-click" data-val="{movieId:2760}"><img src="//ms0.meituan/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src=".jpg@160w_220h_1e_1c" alt="魂断蓝桥" class="board-img" /></a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/2760" title="魂断蓝桥" data-act="boarditem-click" data-val="{movieId:2760}">魂断蓝桥</a></p><p class="star">主演:费雯·丽,罗伯特·泰勒,露塞尔·沃特森</p>
<p class="releasetime">上映时间:1940-05-17(美国)</p>    </div><div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">2</i></p>        </div></div></div></dd></dl></div><div class="pager-main"><ul class="list-pager"><li class="active"><a class="page_1"href="javascript:void(0);" style="cursor: default">1</a></li><li ><a class="page_2"href="?offset=10">2</a></li><li ><a class="page_3"href="?offset=20">3</a></li><li ><a class="page_4"href="?offset=30">4</a></li><li ><a class="page_5"href="?offset=40">5</a></li><li class="sep">...</li><li ><a class="page_10"href="?offset=90">10</a></li><li>  <a class="page_2"href="?offset=10">下一页</a>
</li>
</ul></div></div>
</div></div><div class="footer"><p class="friendly-links">商务合作邮箱:v@maoyan客服电话:10105335违法和不良信息举报电话:4006018900<br/>投诉举报邮箱:tousujubao@meituan舞弊线索举报邮箱:wubijubao@maoyan</p><p class="friendly-links">友情链接 :<a href="" data-query="utm_source=wwwmaoyan" target="_blank">美团网</a><span></span><a href="" data-query="utm_source=wwwmaoyan" target="_blank">美团下载</a></p><p>&copy;2016猫眼电影 maoyan<a href=".aspx?type=0&keyword=京ICP证160733号&pageNo=1" target="_blank">京ICP证160733号</a><a href="" target="_blank">京ICP备16022489号-1</a><a href="=11010102003232" target="_blank">京公网安备 11010102003232号</a><a href="/about/licence" target="_blank">网络文化经营许可证</a><a href="" target="_blank">电子公告服务规则</a></p><p>北京猫眼文化传媒有限公司</p>
</div><!--[if IE 8]><script src="//ms0.meituan/mywww/es5-shim.bbad933f.js"></script><![endif]--><!--[if IE 8]><script src="//ms0.meituan/mywww/es5-sham.d6ea26f4.js"></script><![endif]--><script src="//ms0.meituan/mywww/common.dc33ab40.js"></script>
<script src="//ms0.meituan/mywww/board-index.4aa00764.js"></script>
</body>
</html>

注意:这个时候不要在Elements选项卡中直接查看源码,因为那里面的可能经过js操作与原始的请求不同

看得出来每一部电影都是有dd标签所包含

可以看到的是,排名信息是存储在class=board-index里面的,利用正则如何提取?

<dd>.*?board-index.*?>(.*?)</i>

随后便是提取出电影所需要的图片,检查发现,第二个img才是图片的连接,<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)"

电影名称为:<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>

在提取主演等内容:<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?interger.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>

接下来定义解析页面的方法:

def parse_one_page(html):
pattern = repile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
items = re.findall(pattern, html)

脚本整体如下:

 
# -*- coding:UTF-8 -*-
__autor__ = 'zhouli'
__date__ = '2018/8/7 23:37'
import requests
import re


def get_one_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/55.0.2883.87 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
return None


def parse_one_page(html):
pattern = repile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
items = re.findall(pattern, html)
return items


def main():
url = ''
html = get_one_page(url)
a = parse_one_page(html)
print(a)


if __name__ == '__main__':
main()
 

结果如下:

[('1', '.jpg@160w_220h_1e_1c', '霸王别姬', '\n                主演:张国荣,张丰毅,巩俐\n        ', '上映时间:1993-01-01(中国香港)', '9.', '6'), ('2', '.jpg@160w_220h_1e_1c', '肖申克的救赎', '\n                主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿\n        ', '上映时间:1994-10-14(美国)', '9.', '5'), ('3', '.jpg@160w_220h_1e_1c', '罗马假日', '\n                主演:格利高里·派克,奥黛丽·赫本,埃迪·艾伯特\n        ', '上映时间:1953-09-02(美国)', '9.', '1'), ('4', '.jpg@160w_220h_1e_1c', '这个杀手不太冷', '\n                主演:让·雷诺,加里·奥德曼,娜塔莉·波特曼\n        ', '上映时间:1994-09-14(法国)', '9.', '5'), ('5', '.jpg@160w_220h_1e_1c', '教父', '\n                主演:马龙·白兰度,阿尔·帕西诺,詹姆斯·肯恩\n        ', '上映时间:1972-03-24(美国)', '9.', '3'), ('6', '.jpg@160w_220h_1e_1c', '泰坦尼克号', '\n                主演:莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩\n        ', '上映时间:1998-04-03', '9.', '5'), ('7', '.jpg@160w_220h_1e_1c', '龙猫', '\n                主演:日高法子,坂本千夏,糸井重里\n        ', '上映时间:1988-04-16(日本)', '9.', '2'), ('8', '.jpg@160w_220h_1e_1c', '唐伯虎点秋香', '\n                主演:周星驰,巩俐,郑佩佩\n        ', '上映时间:1993-07-01(中国香港)', '9.', '2'), ('9', '.jpg@160w_220h_1e_1c', '千与千寻', '\n                主演:柊瑠美,入野自由,夏木真理\n        ', '上映时间:2001-07-20(日本)', '9.', '3'), ('10', '.jpg@160w_220h_1e_1c', '魂断蓝桥', '\n                主演:费雯·丽,罗伯特·泰勒,露塞尔·沃特森\n        ', '上映时间:1940-05-17(美国)', '9.', '2')]
def parse_one_page(html):pattern = repile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)items = re.findall(pattern, html)for item in items:yield {'index': item[0],'image': item[1],'title': item[2],'actor': item[3].strip()[3:],'time': item[4].strip()[5:],'score': item[5] + item[6]}

接下来就是分页爬取了:

之前了解到分页选取是offset传参:

import json
import requests
from requests.exceptions import RequestException
import re
import timedef get_one_page(url):try:headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}response = requests.get(url, headers=headers)if response.status_code == 200:return response.textreturn Noneexcept RequestException:return Nonedef parse_one_page(html):pattern = repile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)items = re.findall(pattern, html)for item in items:yield {'index': item[0],'image': item[1],'title': item[2],'actor': item[3].strip()[3:],'time': item[4].strip()[5:],'score': item[5] + item[6]}def write_to_file(content):with open('result.txt', 'a', encoding='utf-8') as f:f.write(json.dumps(content, ensure_ascii=False) + '\n')  # 这个参数才能保证输出结果为中文def main(offset):url = '=' + str(offset)html = get_one_page(url)for item in parse_one_page(html):print(item)write_to_file(item)if __name__ == '__main__':for i in range(10):main(offset=i * 10)time.sleep(1)

运行结果如下:

 

转载于:.html

更多推荐

网络爬虫(六)

本文发布于:2024-02-13 02:37:31,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1690372.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:爬虫   网络

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!