python爬虫urllib

编程入门 行业动态 更新时间:2024-10-10 00:23:10

python<a href=https://www.elefans.com/category/jswz/34/1770264.html style=爬虫urllib"/>

python爬虫urllib

目录

1.获取豆瓣电影第一页

1.1步骤

1.2数据下载到本地 

2.获取豆瓣电影前十页 


1.获取豆瓣电影第一页

1.1步骤

1.请求对象的定制

2.模拟浏览器向服务器发送请求

3.获取响应的数据

4.打印

import urllib.requesturl='=25&interval_id=100%3A90&action=&start=0&limit=20'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
}
request=urllib.request.Request(url=url,headers=headers)
response=urllib.request.urlopen(request)
content=response.read().decode('utf-8')
print(content)

 

附豆瓣电影动画类排行榜第一页部分源码,榜一榜二千与千寻,大闹天宫

[{"rating":["9.4","50"],"rank":1,"cover_url":"\/view\/photo\/s_ratio_poster\/public\/p2557573348.jpg","is_playable":false,"id":"1291561","types":["剧情","动画","奇幻"],"regions":["日本"],"title":"千与千寻","url":"https:\/\/movie.douban\/subject\/1291561\/","release_date":"2019-06-21","actor_count":26,"vote_count":2222161,"score":"9.4","actors":["柊瑠美","入野自由","夏木真理","菅原文太","中村彰男","玉井夕海","神木隆之介","内藤刚志","泽口靖子","我修院达也","大泉洋","小林郁夫","上条恒彦","小野武彦","田壮壮","王琳","安田显","户次重幸","胡立成","山像香","斋藤志郎","脇田茂","彭昱畅","井柏然","周冬雨","毛毛头"],"is_watched":false},{"rating":["9.4","50"],"rank":2,"cover_url":"\/view\/photo\/s_ratio_poster\/public\/p2184505167.jpg","is_playable":true,"id":"1418019","types":["剧情","动画","奇幻","古装"],"regions":["中国大陆"],"title":"大闹天宫","url":"https:\/\/movie.douban\/subject\/1418019\/","release_date":"1961","actor_count":7,"vote_count":432074,"score":"9.4","actors":["邱岳峰","富润生","毕克","尚华","于鼎","李梓","刘广宁"],"is_watched":false},{"rating":["9.3","50"],"rank":3,"cover_url":"\/view\/photo\/s_ratio_poster\/public\/p1461851991.jpg","is_playable":true,"id":"2131459","types":

 

1.2数据下载到本地 

open默认使用gbk编码,要是保存汉字,需要在open方法中指定编码格式为utf -8

有两种方法

1.2.1

fp=open('douban.json','w',encoding='utf-8')
fp.write(content)

1.2.2

with open('douban.json','w',encoding='utf-8') as fp:fp.write(content)

2.获取豆瓣电影前十页 

获取前十页就需要把每一页封装起来,根据刷新找每一页的地址会发现

 第一页start=0,第二页start=1,……limit都是20,表示一页有20个数据

import urllib.request
import urllib.parse
# 下载前10页数据
# 下载的步骤:1.请求对象的定制 2.获取响应的数据 3.下载
# 每执行一次返回一个request对象
def create_request(page):base_url = '=25&interval_id=100%3A90&action=&start=0&limit=20'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',"cookie": "BIDUPSID=D5B7B5DC9EC418BF086F8034419288FE; PSTM=1678029839; BAIDUID=D5B7B5DC9EC418BF5368CDA9BA195075:FG=1; BD_UPN=12314753; BDUSS=XZxTHVQbXVQUEtwdzZVQ3JqWmNzNXViR1RiSThYWnl3bWYtMEFzZTI0VzU0UzVrRVFBQUFBJCQAAAAAAQAAAAEAAADKzPFDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAALlUB2S5VAdkZ; BDUSS_BFESS=XZxTHVQbXVQUEtwdzZVQ3JqWmNzNXViR1RiSThYWnl3bWYtMEFzZTI0VzU0UzVrRVFBQUFBJCQAAAAAAQAAAAEAAADKzPFDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAALlUB2S5VAdkZ; MCITY=-307%3A; BAIDUID_BFESS=D5B7B5DC9EC418BF5368CDA9BA195075:FG=1; B64_BOT=1; BA_HECTOR=0k0ga585810g2g218l200l851i8lgf91p; ZFY=woKzUXT6SomN3BpQdFG:B3:AH80tsHyB8:Ajy7WZJ:AY1GU:C; COOKIE_SESSION=86716_3_8_8_5_19_1_2_5_8_0_9_86672_368_3_0_1686816164_1686728918_1686816161%7C9%23380_28_1686728915%7C9; BD_HOME=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BD_CK_SAM=1; PSINO=1; delPer=0; BDRCVFR[r5ISJbddYbD]=jgHDRC9lqMRmvPhuMC8mvqV; H_PS_PSSID=; channel=baiduffz; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; userFrom=ala; ab_sr=1.0.1_MDM4ZWY2MzJkMjRmZGU4ZGY4MTIwODMwYjhkMTljNjRkZWFjMTEzMjk5OTU1ZDQ2OWVmYmNmNTFjNzBmOWU2YzkwMzNlMzMzMmQzNjlhMjI0MTM2MWFhZjkzODAxODFiM2QxODEzZmQzZTdjMDk1ZDk1YTY5YTAyNzk3MzczZTQ5NTJhYWNlNzc0MTJkM2M3Nzc0YWE1ZTBjNTNiMzBjOQ==; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; sugstore=1; H_PS_645EC=5c9bbTDHjVXQaHtU%2FtigrZipb92iS2HjtKe7%2BSUaEZyclekTNob55dKrYDsCLZM1Hf0JXrX6huYv; baikeVisitId=fc790598-fc71-4c38-8490-971425ae8a13"}data={
# 1 2 3 4
# 0 20 40 60'start':(page-1)*20,'limit':20}
# data编码data = urllib.parse.urlencode(data)url = base_url + datarequest = urllib.request.Request(url=url,headers=headers)return request
# 获取网页源码
def get_content(request):response = urllib.request.urlopen(request)content = response.read().decode('utf‐8')return content
def down_load(page,content):with open('douban_'+str(page)+'.json','w',encoding='utf‐8')as fp:fp.write(content)
if __name__ == '__main__':start_page = int(input('请输入起始页码'))end_page = int(input('请输入结束页码'))
for page in range(start_page,end_page+1):request = create_request(page)content = get_content(request)down_load(page,content)

这样已经下载到本地了

更多推荐

python爬虫urllib

本文发布于:2024-02-13 17:41:13,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1759449.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:爬虫   python   urllib

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!