爬虫urllib"/>
python爬虫urllib
目录
1.获取豆瓣电影第一页
1.1步骤
1.2数据下载到本地
2.获取豆瓣电影前十页
1.获取豆瓣电影第一页
1.1步骤
1.请求对象的定制
2.模拟浏览器向服务器发送请求
3.获取响应的数据
4.打印
import urllib.requesturl='=25&interval_id=100%3A90&action=&start=0&limit=20'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
}
request=urllib.request.Request(url=url,headers=headers)
response=urllib.request.urlopen(request)
content=response.read().decode('utf-8')
print(content)
附豆瓣电影动画类排行榜第一页部分源码,榜一榜二千与千寻,大闹天宫
[{"rating":["9.4","50"],"rank":1,"cover_url":"\/view\/photo\/s_ratio_poster\/public\/p2557573348.jpg","is_playable":false,"id":"1291561","types":["剧情","动画","奇幻"],"regions":["日本"],"title":"千与千寻","url":"https:\/\/movie.douban\/subject\/1291561\/","release_date":"2019-06-21","actor_count":26,"vote_count":2222161,"score":"9.4","actors":["柊瑠美","入野自由","夏木真理","菅原文太","中村彰男","玉井夕海","神木隆之介","内藤刚志","泽口靖子","我修院达也","大泉洋","小林郁夫","上条恒彦","小野武彦","田壮壮","王琳","安田显","户次重幸","胡立成","山像香","斋藤志郎","脇田茂","彭昱畅","井柏然","周冬雨","毛毛头"],"is_watched":false},{"rating":["9.4","50"],"rank":2,"cover_url":"\/view\/photo\/s_ratio_poster\/public\/p2184505167.jpg","is_playable":true,"id":"1418019","types":["剧情","动画","奇幻","古装"],"regions":["中国大陆"],"title":"大闹天宫","url":"https:\/\/movie.douban\/subject\/1418019\/","release_date":"1961","actor_count":7,"vote_count":432074,"score":"9.4","actors":["邱岳峰","富润生","毕克","尚华","于鼎","李梓","刘广宁"],"is_watched":false},{"rating":["9.3","50"],"rank":3,"cover_url":"\/view\/photo\/s_ratio_poster\/public\/p1461851991.jpg","is_playable":true,"id":"2131459","types":
1.2数据下载到本地
open默认使用gbk编码,要是保存汉字,需要在open方法中指定编码格式为utf -8
有两种方法
1.2.1
fp=open('douban.json','w',encoding='utf-8') fp.write(content)1.2.2
with open('douban.json','w',encoding='utf-8') as fp:fp.write(content)
2.获取豆瓣电影前十页
获取前十页就需要把每一页封装起来,根据刷新找每一页的地址会发现
第一页start=0,第二页start=1,……limit都是20,表示一页有20个数据
import urllib.request
import urllib.parse
# 下载前10页数据
# 下载的步骤:1.请求对象的定制 2.获取响应的数据 3.下载
# 每执行一次返回一个request对象
def create_request(page):base_url = '=25&interval_id=100%3A90&action=&start=0&limit=20'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',"cookie": "BIDUPSID=D5B7B5DC9EC418BF086F8034419288FE; PSTM=1678029839; BAIDUID=D5B7B5DC9EC418BF5368CDA9BA195075:FG=1; BD_UPN=12314753; BDUSS=XZxTHVQbXVQUEtwdzZVQ3JqWmNzNXViR1RiSThYWnl3bWYtMEFzZTI0VzU0UzVrRVFBQUFBJCQAAAAAAQAAAAEAAADKzPFDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAALlUB2S5VAdkZ; BDUSS_BFESS=XZxTHVQbXVQUEtwdzZVQ3JqWmNzNXViR1RiSThYWnl3bWYtMEFzZTI0VzU0UzVrRVFBQUFBJCQAAAAAAQAAAAEAAADKzPFDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAALlUB2S5VAdkZ; MCITY=-307%3A; BAIDUID_BFESS=D5B7B5DC9EC418BF5368CDA9BA195075:FG=1; B64_BOT=1; BA_HECTOR=0k0ga585810g2g218l200l851i8lgf91p; ZFY=woKzUXT6SomN3BpQdFG:B3:AH80tsHyB8:Ajy7WZJ:AY1GU:C; COOKIE_SESSION=86716_3_8_8_5_19_1_2_5_8_0_9_86672_368_3_0_1686816164_1686728918_1686816161%7C9%23380_28_1686728915%7C9; BD_HOME=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BD_CK_SAM=1; PSINO=1; delPer=0; BDRCVFR[r5ISJbddYbD]=jgHDRC9lqMRmvPhuMC8mvqV; H_PS_PSSID=; channel=baiduffz; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; userFrom=ala; ab_sr=1.0.1_MDM4ZWY2MzJkMjRmZGU4ZGY4MTIwODMwYjhkMTljNjRkZWFjMTEzMjk5OTU1ZDQ2OWVmYmNmNTFjNzBmOWU2YzkwMzNlMzMzMmQzNjlhMjI0MTM2MWFhZjkzODAxODFiM2QxODEzZmQzZTdjMDk1ZDk1YTY5YTAyNzk3MzczZTQ5NTJhYWNlNzc0MTJkM2M3Nzc0YWE1ZTBjNTNiMzBjOQ==; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; sugstore=1; H_PS_645EC=5c9bbTDHjVXQaHtU%2FtigrZipb92iS2HjtKe7%2BSUaEZyclekTNob55dKrYDsCLZM1Hf0JXrX6huYv; baikeVisitId=fc790598-fc71-4c38-8490-971425ae8a13"}data={
# 1 2 3 4
# 0 20 40 60'start':(page-1)*20,'limit':20}
# data编码data = urllib.parse.urlencode(data)url = base_url + datarequest = urllib.request.Request(url=url,headers=headers)return request
# 获取网页源码
def get_content(request):response = urllib.request.urlopen(request)content = response.read().decode('utf‐8')return content
def down_load(page,content):with open('douban_'+str(page)+'.json','w',encoding='utf‐8')as fp:fp.write(content)
if __name__ == '__main__':start_page = int(input('请输入起始页码'))end_page = int(input('请输入结束页码'))
for page in range(start_page,end_page+1):request = create_request(page)content = get_content(request)down_load(page,content)
这样已经下载到本地了
更多推荐
python爬虫urllib
发布评论