python3.9 BeautifulSoup爬取豆瓣电影 Top 250

编程入门 行业动态 更新时间:2024-10-14 06:23:33

python3.9 BeautifulSoup爬取<a href=https://www.elefans.com/category/jswz/34/1767641.html style=豆瓣电影 Top 250"/>

python3.9 BeautifulSoup爬取豆瓣电影 Top 250

python3.9 BeautifulSoup爬取豆瓣电影 Top 250

工作之余学习了python,今天就用python写了一个爬取豆瓣电影Top 250的脚本,不得不感慨python在爬虫这一块有天然的优势!好了,进入正题。

数据全是包装在li下,所以就简单了,我们就可以利用BeautifulSoup去获取所有的li标签,然后循环取出我们想要的数据!
BeautifulSoup可以在.4.0/ 上面学,中文翻译的,对英语不太好的小伙伴很友好,比如说我,哈哈哈。。。虽然翻译的不是最新版,但也无妨我们去学习!
直接贴代码了,里面有一些必要的注释。
有点冗余,等有空看看还能优化不

import requests
from bs4 import BeautifulSoup""""获取指定url路径的页面
"""
def douban_rank250(url):headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}r = requests.get(url, headers=headers)return r.text"""保存页面,主要用做本地测试
"""
def save_html(html):filename = "D:\\learning\\python\\douban_rank_250.html"with open(filename, "w", encoding="UTF-8") as file_object:file_object.write(html)"""本地测试,读取页面
"""
def open_html():filename = "D:\\learning\\python\\豆瓣电影 Top 250.html"data = ""with open(filename, "r", encoding="UTF-8") as file_object:lines = file_object.readlines()for line in lines:data += linereturn data"""解析页面元素
"""
def parse(html):soup = BeautifulSoup(html, "html.parser")items = soup.find_all("div", class_="item")for item in items:result = ""print("---*******************---")rank = item.find("em")print("rank = " + rank.string)result += "rank " + rank.string + "\n"# 获取图片的urlimg_tag = item.find("img")img_src = img_tag["src"]print("image url = " + img_src)result += "image url = " + img_src + "\n"# 获取内容hd_tag = item.find("div", class_="hd")movie_url = hd_tag.a['href']print("movie url = " + movie_url)result += "movie url = " + movie_url + "\n"spans = hd_tag.a.find_all("span")movie_title = ""for span_content in spans:movie_title += span_content.string.strip()print("movie title = " + movie_title)result += "movie title = " + movie_title + "\n"bd_tag = item.find("div", class_="bd")p_tags = bd_tag.find_all("p")if len(p_tags) == 2:desc = p_tags[0].text.strip()quote = p_tags[1].text.strip()print("desc = ")result += "desc =" + "\n"result += desc + "\n"print(desc)print("quote = " + quote)result += "quote = " + quote + "\n"star_tag = item.find("div", class_="star")spans = star_tag.find_all("span")print("star and comments = ")result += "star and comments = " + "\n"for span_content in spans:print(span_content.string)result += str(span_content.string) + "\n"result += "\n"result += "\n"save_to_file(result)"""存储结果到本地文件
"""
def save_to_file(content):with open("D:\\learning\\python\\result.txt", "a", encoding="UTF-8") as file_object:file_object.write(content)if __name__ == '__main__':# res = douban_rank250()# save_html(res)# res = open_html()# parse(res)max_page = 10size = 25for page in range(max_page):url = "=" + str(page * size) + "&filter="res = douban_rank250(url)parse(res)

生成的结果文件如下

更多推荐

python3.9 BeautifulSoup爬取豆瓣电影 Top 250

本文发布于:2024-03-23 01:31:53,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1738893.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:豆瓣   电影   BeautifulSoup   Top

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!