admin管理员组文章数量:1604751
数据集:Google Play Store Apps 网址:https://www.kaggle/lava18/google-play-store-apps?select=googleplaystore.csv
此数据集包含了两个csv文件,一个是Google play store app的整体数据,一个是Google play store用户评论的数据。
用户评论数据主观性非常大,且内容少,所以这里我们选取的是Google play store app的整体数据进行分析。
Google play store文件包含了13个字段,分别是
App: Application name(应用名称)
Category: Category the app belongs to(分类)
Rating: Overall user rating of the app (as when scraped)(评分)
Reviews: Number of user reviews for the app (as when scraped)(评论数)
Size: Size of the app (as when scraped)(大小)
Installs: Number of user downloads/installs for the app (as when scraped)(下载/安装量)
Type: Paid or Free(付费与否)
Price: Price of the app (as when scraped)(价格)
Content Rating: Age group the app is targeted at - Children / Mature 21+ / Adult(内容分级)
Genres: An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.(次分类)
Last Updated: Date when the app was last updated on Play Store (as when scraped)
Current Ver: Current version of the app available on Play Store (as when scraped)
Android Ver: Min required Android version (as when scraped)
一、导入数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('D:/Textbooks/Kaggle/Google Play Store Apps/googleplaystore.csv')
此数据包含了10841行,13列。
data.shape
Out[291]: (10841, 13)
Rating 列的空缺值非常多,高至1474。
data.isna().sum().sort_values(ascending=False)
Out[293]:
Rating 1474
Current Ver 8
Android Ver 3
Content Rating 1
Type 1
Last Updated 0
Genres 0
Price 0
Installs 0
Size 0
Reviews 0
Category 0
App 0
dtype: int64
二、数据清洗
因为此处不对版本和更新时间进行分析,所以首先删除掉这三列。
data.drop(columns=['Android Ver','Current Ver','Last Updated'],
inplace=True)
1. App
data['App'].unique().size
Out[295]: 9660
App在谷歌应用商店里不可以重名,这里需要删除重复值,确保分析结果准确。
data.drop_duplicates('App',inplace=True)
2. Category
print(data.Category.unique())
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']
Category 有一个异常值1.9,删除。
data=data[data.Category != '1.9']
3. Rating
data.Rating.isna().sum()
Rating 的空缺值非常多,删除的话会缺失很多数据,但是用平均数或者中位数填充也不妥当,所以这里选择忽略na值,不做处理。
4. Reviews
转换为数值型。
data.Reviews.dtype
Out[300]: dtype('O')
data.Reviews = pd.to_numeric(data.Reviews)
5. Size
data.Size.value_counts()
Out[306]:
Varies with device 1227
11M 182
12M 181
13M 177
14M 177
...
226k 1
903k 1
190k 1
400k 1
54k 1
Name: Size, Length: 461, dtype: int64
Size数据去掉单位,统一转换成以k为单位的数值型。
def f(x):
if x[-1] == 'M':
res = float(x[:-1])*1024
elif x[-1] == 'k':
res = float(x[:-1])
else:
res = np.nan
return res
data.Size = data.Size.apply(f)
6. Installs
data.Installs.unique()
Out[308]:
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
'50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
'1,000,000,000+', '1,000+', '500,000,000+
版权声明:本文标题:Google Play Store谷歌应用商店游戏数据分析 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://www.elefans.com/dianzi/1728465031a1159351.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论