爬取qq音乐的评论并生成词云——以《听妈妈的话》为例

编程入门行业动态更新时间:2024-10-21 05:32:30

爬取qq音乐的评论并生成词云——以《听妈妈的话》<a href=https://www.elefans.com/category/jswz/34/1769536.html style= 为例"/>

爬取qq音乐的评论并生成词云——以《听妈妈的话》为例

爬取qq音乐的评论并生成词云

我们选取的是歌曲的周杰伦的 听妈妈的话
先看效果图

首先，我们进去qq音乐找到这首歌 ~~网易云出来挨打~~
.html

点击评论或者下拉就可以看到评论了。
按F12进入调页面，选择network，然后点击评论的下一页观察页面请求，出现一些图片，还有一个fcg开头的，观察响应界面

哦豁，我们神奇的发现评论数据藏在comment 对象下的commentlist 数组中，是json数据。
看一下Headers；

直接复制请求url，粘贴到地址栏，很幸运的是可以直接打开，可以直接打开就省事很多了：

观察不同页的评论的请求url

看来看去都没什么变化，变化的有两个：

“pagenum” ，页数
“lasthotcommentid” ，上一条热门评论的id

接下来划重点了，不要走神：

右键那个comment请求，复制，复制cURL。
得到这么一串：

curl '.fcg?g_tk_new_20200303=5381&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=GB2312&notice=0&platform=yqq.json&needNewCode=0&cid=205360772&reqtype=2&biztype=1&topid=102066257&cmd=8&needmusiccrit=0&pagenum=1&pagesize=25&lasthotcommentid=song_102066257_18578995_1591191607&domain=qq.com&ct=24&cv=10101010' \-H 'authority: c.y.qq.com' \-H 'sec-ch-ua: "\\Not\"A;Brand";v="99", "Chromium";v="84", "Microsoft Edge";v="84"' \-H 'accept: application/json, text/javascript, */*; q=0.01' \-H 'sec-ch-ua-mobile: ?0' \-H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.30 Safari/537.36 Edg/84.0.522.11' \-H 'origin: ' \-H 'sec-fetch-site: same-site' \-H 'sec-fetch-mode: cors' \-H 'sec-fetch-dest: empty' \-H 'referer: .html' \-H 'accept-language: zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6' \-H 'cookie: pgv_pvid=2783427480; ts_uid=1847083617; pgv_pvi=8854632448; userAction=1; yqq_stat=0; pgv_info=ssid=s2338435723; pgv_si=s7558193152; ts_last=y.qq.com/n/yqq/song/002hXDfk0LX9KO.html' \--compressed

生成请求代码：

import requests
url = '.fcg'querystring = {"g_tk_new_20200303": "5381", "g_tk": "5381", "loginUin": "0", "hostUin": "0", "format": "json","inCharset": "utf8", "outCharset": "GB2312", "notice": "0", "platform": "yqq.json", "needNewCode": "0", "cid": "205360772", "reqtype": "2", "biztype": "1", "topid": "102066257", "cmd": "8", "needmusiccrit": "0","pagenum": "1", "pagesize": "25", "lasthotcommentid": "song_102066257_18578995_1591191607", "domain": "qq.com","ct": "24", "cv": "10101010"
}headers = {'accept': 'application/json, text/javascript, */*; q=0.01','referer': '.html','origin': '','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.30 Safari/537.36 Edg/84.0.522.11','authority': 'c.y.qq.com'}response = requests.request('GET', url, headers = headers, params = querystring)
#获取 comment 对象下的commentlist 数组
#print(response.text)

把评论存放到文件中，去重复行，去除非文字的表情

import re
import json
file_path = './comment.txt'
files = open(file_path,'a',encoding = 'utf8') #打开一个文件用于追加
comment_json = json.loads(response.text)
comment_list = comment_json['comment']['commentlist']
last_comment = comment_list[24]
last_comment_id = last_comment['commentid']
#遍历每一页
try:for i in range(500):    #页数随意点    querystring['pagenum'] = str(i)querystring['lasthotcommentid'] = last_comment_idresponse = requests.request('GET',url, headers=headers,params=querystring)comment_json = json.loads(response.text)comment_list = comment_json['comment']['commentlist']last_comment = comment_list[24]last_comment_id = last_comment['commentid']#写入文件。并删除表情for comment in comment_list:rootcommentcontent = comment['rootcommentcontent']compile = re.compile(r'\[em].*[/em].',re.S)rootcommentcontent = re.sub(compile,'',rootcommentcontent)files.write(rootcommentcontent+'\n')print('正在写入第',i,'页')except Exception as e:print("下标溢出")finally:#去除重复行with open(file_path,'r+',encoding='utf8') as rwf:new_content = list(set(rwf.readlines()))rwf.seek(0)rwf.writelines(new_content)rwf.truncate(rwf.tell())files.close()

生成词云：

你需要准备一个中文字体库，不然中文可能乱码，以及随意一张背景图

我这里准备是是小楷
以及随便找了一张图（记得随意点

from PIL import Image
import numpy as np
import jieba
from wordcloud import WordCloud,ImageColorGenerator
from matplotlib import pyplot as pltimg = './33.jpg'f = open(file_path,'r',encoding='utf8').read()background = np.array(Image.open(img))jieba_text = ' '.join(jieba.cut(f)) #注意是空格wordcloud = WordCloud(font_path = './font/xiaokai.ttf', #导入中文字体background_color = 'white', #背景颜色mask = background).generate(jieba_text)#jieba.add_word('这首歌') #添加自己的词库分词# 基于彩色图像生成相应彩色image_colors = ImageColorGenerator(background)wordcloud.to_file('听妈妈的话.png')plt.imshow(wordcloud.recolor(color_func = image_colors),interpolation='bilinear')plt.axis('off')plt.show()

完整代码：

import requests
import re
import json
from PIL import Image
import numpy as np
import jieba
from wordcloud import WordCloud,ImageColorGenerator
from matplotlib import pyplot as plturl = '.fcg'querystring = {"g_tk_new_20200303": "5381", "g_tk": "5381", "loginUin": "0", "hostUin": "0", "format": "json","inCharset": "utf8", "outCharset": "GB2312", "notice": "0", "platform": "yqq.json", "needNewCode": "0", "cid": "205360772", "reqtype": "2", "biztype": "1", "topid": "102066257", "cmd": "8", "needmusiccrit": "0","pagenum": "1", "pagesize": "25", "lasthotcommentid": "song_102066257_18578995_1591191607", "domain": "qq.com","ct": "24", "cv": "10101010"
}headers = {'accept': 'application/json, text/javascript, */*; q=0.01','referer': '.html','origin': '','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.30 Safari/537.36 Edg/84.0.522.11','authority': 'c.y.qq.com'}response = requests.request('GET', url, headers = headers, params = querystring)
#获取 comment 对象下的commentlist 数组
#print(response.text)#存放到txt中
file_path = './comment.txt'files = open(file_path,'a',encoding = 'utf8') #打开一个文件用于追加def view(): img = './33.jpg'f = open(file_path,'r',encoding='utf8').read()background = np.array(Image.open(img))jieba_text = ' '.join(jieba.cut(f)) #注意是空格wordcloud = WordCloud(font_path = './font/xiaokai.ttf', #导入中文字体background_color = 'white', #背景颜色mask = background).generate(jieba_text)#jieba.add_word('这首歌') #添加自己的词库分词# 基于彩色图像生成相应彩色image_colors = ImageColorGenerator(background)wordcloud.to_file('听妈妈的话.png')plt.imshow(wordcloud.recolor(color_func = image_colors),interpolation='bilinear')plt.axis('off')plt.show()comment_json = json.loads(response.text)
comment_list = comment_json['comment']['commentlist']
last_comment = comment_list[24]
last_comment_id = last_comment['commentid']
#遍历每一页
try:for i in range(500):	#页数随意点        querystring['pagenum'] = str(i)querystring['lasthotcommentid'] = last_comment_idresponse = requests.request('GET',url, headers=headers,params=querystring)comment_json = json.loads(response.text)comment_list = comment_json['comment']['commentlist']last_comment = comment_list[24]last_comment_id = last_comment['commentid']#写入文件。并删除表情for comment in comment_list:rootcommentcontent = comment['rootcommentcontent']compile = re.compile(r'\[em].*[/em].',re.S)rootcommentcontent = re.sub(compile,'',rootcommentcontent)files.write(rootcommentcontent+'\n')print('正在写入第',i,'页')except Exception as e:print("下标溢出")finally:#去除重复行with open(file_path,'r+',encoding='utf8') as rwf:new_content = list(set(rwf.readlines()))rwf.seek(0)rwf.writelines(new_content)rwf.truncate(rwf.tell())files.close()view()

最终我们会得到一个存放评论的txt文件以及词云图

其他歌曲也是类似的操作…

更多推荐

爬取qq音乐的评论并生成词云——以《听妈妈的话》为例

本文发布于:2023-07-01 07:37:52，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/972721.html