【python】采集猫咪数据做可视化，猫猫这么可爱，谁不想拥有一只呢？

编程入门行业动态更新时间:2024-10-24 06:24:51

【python】采集猫咪数据做可视化，猫猫这么可爱，谁不想拥有<a href=https://www.elefans.com/category/jswz/34/1764200.html style= 一只呢？"/>

【python】采集猫咪数据做可视化，猫猫这么可爱，谁不想拥有一只呢？

前言

案例知识点

环境介绍：

模块使用:

代码展示

可视化

尾语

前言

朋友晒猫，我使劲薅~

视频刷猫，在线认崽，留言偷崽~

面对这么多猫猫的诱惑，简直超级心动有没有~

萌生自己养只猫猫的想法，但不清楚猫猫价格等情况怎么办？

那么，利用自己技术的时候来了~用python采集一下猫猫网站~

看看都有什么品种？价格？什么猫猫最好养等等...

案例知识点

1、parsel解析模块的使用

2、requests模块的使用

3、保存csv

环境介绍：

python 3.8

pycharm

模块使用:

csv

requests >>> pip install requests

parsel >>> pip install parsel

(用这个模块会报错) (3.7没有lxml 安装parsel会报错)

代码展示

import requests  # 数据请求模块 第三方模块 pip install requests
import parsel  # 数据解析模块 第三方模块 pip install parsel
import csv  # 内置模块# 打开一个文件 csv文件 mode 保存方式 a encoding 编码 newline 
f = open('猫咪data.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['地区','店名','标题','价格','浏览次数','卖家承诺','在售只数','年龄','品种','预防','联系人','联系方式','异地运费','是否纯种','猫咪性别','驱虫情况','能否视频','详情页',
])
csv_writer.writeheader() # 写入表头
# 猫咪的列表页url
for page in range(1, 21):print(f'--------------------正在爬取滴{page}页数据内容--------------------')url = f'*****/index.php?/chanpinliebiao_c_2_{page}--24.html'# headers 请求头, 把python代码进行一个简单伪装 (禁止LB 穿上一个外套)# 巳月老师 比较流氓, 对于没什么反爬网站, 都是LB# 我喜欢穿衣服headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}# requests 里面 get 请求方式  为什么老师这里需要用get请求方式?response = requests.get(url=url, headers=headers)# 获取网页源代码 response.text 激发学员学习的兴趣# print(response.text) # 取网页源代码 html 字符串数据# 解析数据  不会css 或者xpath 没有关系 (不能经常用)selector = parsel.Selector(response.text)href = selector.css('#content div.breeds_floor div div a::attr(href)').getall()areas = selector.css('#content div.breeds_floor div div a div.price_area div.area span.color_333::text').getall()# 同时提取两个列表里面的元素内容for index in zip(href, areas):# 返回的是元祖 >>> 根据索引位置取值 ()# *****/index.php?/chanpinxiangqing_404056.html# 猫咪详情页url地址index_url = '****' + index[0]# 字符串 方法去除字符串两端的空格 strip()  在爬取数据的过程中 也要对于数据进行一些简单处理# 方便我们之后做数据分析# css 选择器 这个里面语法还是比较多的, 像VIP课程 这个知识点会讲两个小时(详细)# css选择器 就是根据标签属性提取内容 解析方法: re正则表达式 css选择器 xpath (都要掌握)area = index[1].strip()html_data = requests.get(url=index_url, headers=headers).textselector_1 = parsel.Selector(html_data)title = selector_1.css('.detail_text .title::text').get().strip()shop = selector_1.css('.dinming::text').get().strip()  # 店名price = selector_1.css('.info1 div:nth-child(1) span.red.size_24::text').get()  # 价格views = selector_1.css('.info1 div:nth-child(1) span:nth-child(4)::text').get()  # 浏览次数# replace() 替换promise = selector_1.css('.info1 div:nth-child(2) span::text').get().replace('卖家承诺: ', '')  # 浏览次数num = selector_1.css('.info2 div:nth-child(1) div.red::text').get()  # 在售只数age = selector_1.css('.info2 div:nth-child(2) div.red::text').get()  # 年龄kind = selector_1.css('.info2 div:nth-child(3) div.red::text').get()  # 品种prevention = selector_1.css('.info2 div:nth-child(4) div.red::text').get()  # 预防person = selector_1.css('div.detail_text .user_info div:nth-child(1) .c333::text').get()  # 联系人phone = selector_1.css('div.detail_text .user_info div:nth-child(2) .c333::text').get()  # 联系方式postage = selector_1.css('div.detail_text .user_info div:nth-child(3) .c333::text').get().strip()  # 包邮purebred = selector_1.css('.xinxi_neirong div:nth-child(1) .item_neirong div:nth-child(1) .c333::text').get().strip()  # 是否纯种sex = selector_1.css('.xinxi_neirong div:nth-child(1) .item_neirong div:nth-child(4) .c333::text').get().strip()  # 猫咪性别video = selector_1.css('.xinxi_neirong div:nth-child(2) .item_neirong div:nth-child(4) .c333::text').get().strip()  # 能否视频worming = selector_1.css('.xinxi_neirong div:nth-child(2) .item_neirong div:nth-child(2) .c333::text').get().strip()  # 是否驱虫dit = {'地区': area,'店名': shop,'标题': title,'价格': price,'浏览次数': views,'卖家承诺': promise,'在售只数': num,'年龄': age,'品种': kind,'预防': prevention,'联系人': person,'联系方式': phone,'异地运费': postage,'是否纯种': purebred,'猫咪性别': sex,'驱虫情况': worming,'能否视频': video,'详情页': index_url,}csv_writer.writerow(dit)print(title, area, shop, price, views, promise, num, age,kind, prevention, person, phone, postage, purebred, sex, video, worming, index_url, sep=' | ')

可视化

cat_info['地区'] = cat_info['地区'].astype(str)
cat_info['province'] = cat_info['地区'].map(lambda s: s.split('/')[0])
pv = cat_info['province'].value_counts().reset_index()

# 交易品种占比树状图
from pyecharts import options as opts
from pyecharts.charts import TreeMappingzhong = cat_info['品种'].value_counts().reset_index()
data = [{'value':i[1],'name':i[0]} for i in zip(list(pingzhong['index']),list(pingzhong['品种']))]c = (TreeMap(init_opts=opts.InitOpts(theme=ThemeType.LIGHT)).add("", data).set_global_opts(title_opts=opts.TitleOpts(title="")).set_series_opts(label_opts=opts.LabelOpts(position="inside"))
)c.render_notebook()

# 
price = cat_info.groupby('品种').mean()['价格'].reset_index()
price['价格'] = round(price['价格'],0)
price = price.sort_values(by='价格')

from pyecharts import options as opts
from pyecharts.charts import PictorialBar
from pyecharts.globals import SymbolTypelocation = list(price['品种'])
values = list(price['价格'])c = (PictorialBar(init_opts=opts.InitOpts(theme=ThemeType.LIGHT)).add_xaxis(location).add_yaxis("",values,label_opts=opts.LabelOpts(is_show=False),symbol_size=18,symbol_repeat="fixed",symbol_offset=[0, 0],is_symbol_clip=True,symbol=SymbolType.ROUND_RECT,).reversal_axis().set_global_opts(title_opts=opts.TitleOpts(title="均价排名"),xaxis_opts=opts.AxisOpts(is_show=False),yaxis_opts=opts.AxisOpts(axistick_opts=opts.AxisTickOpts(is_show=False),axisline_opts=opts.AxisLineOpts(linestyle_opts=opts.LineStyleOpts(opacity=0),),),).set_series_opts(label_opts=opts.LabelOpts(position='insideRight'))
)c.render_notebook()

## 浏览次数是否跟价格成正比，散点图
view = cat_info['浏览次数']
money = cat_info['价格']import pyecharts.options as opts
from pyecharts.charts import Scatterx_data = list(view)[:1000]
y_data = list(money)[:1000]c = (Scatter(init_opts=opts.InitOpts(theme=ThemeType.LIGHT)).add_xaxis(xaxis_data=x_data).add_yaxis(series_name="",y_axis=y_data,symbol_size=20,label_opts=opts.LabelOpts(is_show=False),).set_series_opts().set_global_opts(xaxis_opts=opts.AxisOpts(type_="value", splitline_opts=opts.SplitLineOpts(is_show=True),name='浏览次数'),yaxis_opts=opts.AxisOpts(type_="value",axistick_opts=opts.AxisTickOpts(is_show=True),splitline_opts=opts.SplitLineOpts(is_show=True),name='价格'),tooltip_opts=opts.TooltipOpts(is_show=False),)
)c.render_notebook()

# 价格是否与年龄有关，箱型图
a_p = cat_info[['价格','年龄']]
a_p['年龄'] = a_p['年龄'].map(lambda x: x.replace('个月',''))
def ages(s):if s == 'nan':return ss = int(s)if 1 <= s < 3: return '1-3个月'if 3 <= s < 6: return '3-6个月'if 6 <= s < 9:return '6-9个月'if 9 <= s < 12 :return '9-12个月'if s >= 12:return '1年以上'
a_p['age'] = a_p['年龄'].map(ages)
a_p.head()