票房电影信息的抓取"/>
内地票房电影信息的抓取
网页分析
网页链接:
通常来说爬取网站信息都要经过网页分析,然而不同的网站结构决定了我们在进行网页分析师的难易程度,分析过程用文字也很难表述,在此笔者就不细细陈述了。
通过对网站的初步分析该网页采用的是html和通道加载的方式布局网站的不同网页。
因此我们需要采用抓包的方式获取网站不同网页的链接进行分析。从而更快捷的爬取到我们多需要的目的信息。(下面是我们分析网页得到的部分结果。)
提取电影信息
利用循环语句提取网站中各个网页上不同电影的信息,由于网页中的各个信息格式并不是完全相同,所以我们需要在驯化过程中进行报错处理以便于我们能够得到网页结构相同的电影信息,
for page in range(10):url = "/?year=2020&area=china&type=MovieRankingYear&category=all&page=%s&display=list×tamp=1586629225983&version=07bb781100018dd58eafc3b35d42686804c6df8d&dataType=json"%pageprint ("正在爬取第%s页..."%(page+1))html_json = requests.get(url=url, headers=headers)etree_html = etree.HTML(html_json.json()['html'])dd = etree_html.xpath('//div[@class="boxofficelist"]/div/dd')for item in dd:rank = item.xpath('./div/div[1]/i/text()') #排名name = item.xpath('./div/div[2]/h3/a/text()') #电影名称Eng_name = item.xpath('./div/div[2]/h4/a/text()') #英语名try:#首日data_day = item.xpath('./div/div[2]/p[1]/strong[1]/text()') unit_day = item.xpath('./div/div[2]/p[1]/text()[2]')day = [data_day[0]+unit_day[0]]#首周data_week = item.xpath('./div/div[2]/p[1]/strong[2]/text()')unit_week = item.xpath('./div/div[2]/p[1]/text()[4]')week = [data_week[0] + unit_week[0]]#备注remarks = item.xpath('./div/div[2]/p[1]/text()[5]')if len(remarks) == 0:remarks = [''] except:day = ['']week = ['']remarks = ['']#时间及类型explain = item.xpath('./div/div[2]/p[2]/text()')if len(explain) != 0:explain_cut = explain[0].split('\xa0')else:explain_cut = ["", ""]director = item.xpath('./div/div[2]//p[3]/a/text()') #导演to_star = item.xpath('./div/div[2]//p[4]/a/text()') #主演to_star = [','.join(to_star)]#公司compay = item.xpath('./div/div[2]//p[5]/a[1]/text()')if len(compay) != 1:compay = ['']#票房booking_num = item.xpath('./div/div[4]/p[1]/strong/text()')booking_unit = item.xpath('./div/div[4]/p[1]/text()')booking = [booking_num[0]+booking_unit[0]]#累计人数,网页中没有呈现count_people = item.xpath('./div/div[4]/p[2]/text()')if len(count_people) != 0:count_people = [count_people[0].split(':')[1]]else:count_people = ['']score1 = item.xpath('./div/div[3]/p[1]//text()')score = [''.join(score1)]if len(score) == 0:score = ['']#评分人数join_people = item.xpath('./div/div[3]/p[2]/text()')if len(join_people) == 1: join_people = [join_people[0].replace("人评分","")]else:join_people = ['']
信息写入csv文件
文件信息的写入已经是老生常谈了,一直都是三个步骤,分别是创建并打开文件,写入目的信息以及关闭文件。其代码如下:
#创建文件并打开文件
fp = open("./美团评论.csv",'a',newline='',encoding = 'utf-8-sig')
writer = csv.writer(fp)
#写入内容
writer.writerow(())
#关闭文件
fp.close()
代码汇总
``
import requests,csv,time, re
from lxml import etree #导入所需的包url = "/?year=2019&area=china&type=MovieRankingYear&category=all&page=0&display=list×tamp=1587305829678&version=07bb781100018dd58eafc3b35d42686804c6df8d&dataType=json"
headers ={'Cookie':'_userCode_=2020419151208002; _userIdentity_=2020419151205632; userId=0; defaultCity=%25E5%258C%2597%25E4%25BA%25AC%257C290; _tt_=7FB7CE67CC8A2FC43C4519BB411B1C36; Hm_lvt_6dd1e3b818c756974fb222f0eae5512e=1587280322; __utmc=221034756; __utmz=221034756.1587280322.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _ydclearance=f4a43362b532e553fee6c79b-1f3e-4ece-855b-602f5f3bc84d-1587329420; __utma=221034756.1321858914.1587280322.1587322603.1587328109.4; __utmt=1; __utmt_~1=1; __utmb=221034756.2.10.1587328109; Hm_lpvt_6dd1e3b818c756974fb222f0eae5512e=1587328109','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'
}
# 创建文件夹并打开
fp = open("./电影信息2020.csv", 'a', newline='', encoding = 'utf-8-sig')
writer = csv.writer(fp) #我要写入
writer.writerow(('排名', '名称', '导演', '演员', '评分'))for page in range(10):print ("正在爬取%s页。。。。。"%(page+1))url = "/?year=2020&area=china&type=MovieRankingYear&category=all&page=%s&display=list×tamp=1586758678446&version=07bb781100018dd58eafc3b35d42686804c6df8d&dataType=json"%pageresponse = requests.get(url=url, headers=headers)html_etree = etree.HTML(response.json()["html"]) # 看成一个筛子,树状# 提取信息dd = html_etree.xpath('//div[@class="boxofficelist"]/div/dd')for item in dd:try:rank = item.xpath('./div/div[1]/i/text()')try:director = item.xpath('./div/div[2]//p[3]/a/text()') #模糊匹配except:passactor = item.xpath('./div/div[2]//p[4]/a/text()')act = "、".join(actor) #拼接字符串act = [act]score = item.xpath('./div/div[3]/p[2]/text()')score = score[0].replace("人评分", "") #替换掉“人评分”score = [score]day = item.xpath('./div/div[2]/p[1]/strong[1]/text()')name = item.xpath('./div/div[2]/h3/a/text()')result = (rank+name+director+act+score)print (result)writer.writerow(result)except: print ("这里报错了")break
fp.close()
endTime =time.time()#获取结束时的时间
useTime =(endTime-startTime)
print ("该次所获的信息一共使用%s秒"%useTime)
总结
这次课所学的内容非常多,其中包含的知识点也是特别多,有很多地方很难理解,所以对知识点的理解还是有些欠缺,读者和笔者都需共同努力。
更多推荐
内地票房电影信息的抓取
发布评论