《Python数据分析与挖掘实战》第7章

编程入门 行业动态 更新时间:2024-10-27 16:34:20

《Python数据分析与挖掘<a href=https://www.elefans.com/category/jswz/34/1769775.html style=实战》第7章"/>

《Python数据分析与挖掘实战》第7章

本文是基于《Python数据分析与挖掘实战》的第七章的数据——《航空公司客户价值分析》做的分析对部分代码,做出补充,对原文中的雷达图进行了实现。

1. 背景与目标分析

此项目旨在根据航空公司提供的数据,对其客户进行分类,并且比较不同类别客户的价值,为能够更好的为客户提供个性化服务做参考。

2. 整体流程如下

import pandas as pd
import numpy as np
data = pd.read_csv('air_data.csv', encoding='utf8')
data.head()
MEMBER_NOFFP_DATEFIRST_FLIGHT_DATEGENDERFFP_TIERWORK_CITYWORK_PROVINCEWORK_COUNTRYAGELOAD_TIME...ADD_Point_SUMEli_Add_Point_SumL1Y_ELi_Add_PointsPoints_SumL1Y_Points_SumRation_L1Y_Flight_CountRation_P1Y_Flight_CountRation_P1Y_BPSRation_L1Y_BPSPoint_NotFlight
0549932006/11/022008/12/246.北京CN31.02014/03/31...399921144521111006197603702110.5095240.4904760.4872210.51277750
1280652007/02/192007/08/036NaN北京CN42.02014/03/31...1200053288532884157682384100.5142860.4857140.4892890.51070833
2551062007/02/012007/08/306.北京CN40.02014/03/31...1549155202517114063612337980.5185190.4814810.4814670.51853026
3211892008/08/222008/08/235Los AngelesCAUS64.02014/03/31...034890348903722041861000.4347830.5652170.5517220.44827512
4395462009/04/102009/04/156贵阳贵州CN48.02014/03/31...2270464969649693388132103650.5328950.4671050.4690540.53094339

5 rows × 44 columns

import chardetf = open('air_data.csv', 'rb')
data_1 = f.read()
print(chardet.detect(data_1))
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
print(type(data))
# print(data.head())
data.isnull().sum()
<class 'pandas.core.frame.DataFrame'>MEMBER_NO                     0
FFP_DATE                      0
FIRST_FLIGHT_DATE             0
GENDER                        3
FFP_TIER                      0
WORK_CITY                  2269
WORK_PROVINCE              3248
WORK_COUNTRY                 26
AGE                         420
LOAD_TIME                     0
FLIGHT_COUNT                  0
BP_SUM                        0
EP_SUM_YR_1                   0
EP_SUM_YR_2                   0
SUM_YR_1                    551
SUM_YR_2                    138
SEG_KM_SUM                    0
WEIGHTED_SEG_KM               0
LAST_FLIGHT_DATE              0
AVG_FLIGHT_COUNT              0
AVG_BP_SUM                    0
BEGIN_TO_FIRST                0
LAST_TO_END                   0
AVG_INTERVAL                  0
MAX_INTERVAL                  0
ADD_POINTS_SUM_YR_1           0
ADD_POINTS_SUM_YR_2           0
EXCHANGE_COUNT                0
avg_discount                  0
P1Y_Flight_Count              0
L1Y_Flight_Count              0
P1Y_BP_SUM                    0
L1Y_BP_SUM                    0
EP_SUM                        0
ADD_Point_SUM                 0
Eli_Add_Point_Sum             0
L1Y_ELi_Add_Points            0
Points_Sum                    0
L1Y_Points_Sum                0
Ration_L1Y_Flight_Count       0
Ration_P1Y_Flight_Count       0
Ration_P1Y_BPS                0
Ration_L1Y_BPS                0
Point_NotFlight               0
dtype: int64
data.describe()
MEMBER_NOFFP_TIERAGEFLIGHT_COUNTBP_SUMEP_SUM_YR_1EP_SUM_YR_2SUM_YR_1SUM_YR_2SEG_KM_SUM...ADD_Point_SUMEli_Add_Point_SumL1Y_ELi_Add_PointsPoints_SumL1Y_Points_SumRation_L1Y_Flight_CountRation_P1Y_Flight_CountRation_P1Y_BPSRation_L1Y_BPSPoint_NotFlight
count62988.00000062988.00000062568.00000062988.00000062988.00000062988.062988.00000062437.00000062850.00000062988.000000...62988.00000062988.00000062988.00000062988.000062988.00000062988.00000062988.00000062988.00000062988.00000062988.000000
mean31494.5000004.10216242.47634611.83941410925.0812540.0265.6896235355.3760645604.02601417123.878691...1355.0062231620.6958471080.37888212545.77716638.7395850.4864190.5135810.5222930.4684222.728155
std18183.2137150.3738569.88591514.04947116339.4861510.01645.7028548109.4501478703.36424720960.844623...7868.4770008294.3989555639.85725420507.816712601.8198630.3191050.3191050.3396320.3389567.364164
min1.0000004.0000006.0000002.0000000.0000000.00.0000000.0000000.000000368.000000...0.0000000.0000000.0000000.00000.0000000.0000000.0000000.0000000.0000000.000000
25%15747.7500004.00000035.0000003.0000002518.0000000.00.0000001003.000000780.0000004747.000000...0.0000000.0000000.0000002775.0000700.0000000.2500000.2888890.2581500.1679540.000000
50%31494.5000004.00000041.0000007.0000005700.0000000.00.0000002800.0000002773.0000009994.000000...0.0000000.0000000.0000006328.50002860.5000000.5000000.5000000.5142520.4767470.000000
75%47241.2500004.00000048.00000015.00000012831.0000000.00.0000006574.0000006845.75000021271.250000...0.000000345.0000000.00000014302.50007500.0000000.7111110.7500000.8150910.7283751.000000
max62988.0000006.000000110.000000213.000000505308.0000000.074460.000000239560.000000234188.000000580717.000000...984938.000000984938.000000728282.000000985572.0000728282.0000001.0000001.0000000.9999890.999993140.000000

8 rows × 36 columns

data = data[data['SUM_YR_1'].notnull() & data['SUM_YR_2'].notnull()]
data_1 = data['SUM_YR_1'] != 0
data_2 = data['SUM_YR_2'] != 0
data_3 = (data['SEG_KM_SUM'] == 0) & (data['avg_discount'] == 0)
data = data[data_1 | data_2 | data_3] #该规则是“或”
data.head()
data.describe()
MEMBER_NOFFP_TIERAGEFLIGHT_COUNTBP_SUMEP_SUM_YR_1EP_SUM_YR_2SUM_YR_1SUM_YR_2SEG_KM_SUM...ADD_Point_SUMEli_Add_Point_SumL1Y_ELi_Add_PointsPoints_SumL1Y_Points_SumRation_L1Y_Flight_CountRation_P1Y_Flight_CountRation_P1Y_BPSRation_L1Y_BPSPoint_NotFlight
count62044.00000062044.00000061632.00000062044.00000062044.00000062044.062044.00000062044.00000062044.00000062044.000000...62044.00000062044.00000062044.00000062044.00000062044.00000062044.00000062044.00000062044.00000062044.00000062044.000000
mean31485.2379284.10365242.50430011.97135911057.7724680.0269.7320935389.2981645676.82668817321.694749...1367.3367291637.0688221092.36022812694.8412906726.7314160.4896660.5103340.5193880.4718732.754191
std18188.6505370.3763229.88587714.11061916424.9448880.01657.8466558123.8492878736.09262821052.728111...7906.9679038336.7419575671.52066020617.69416812671.9107380.3170910.3170910.3378790.3373187.399359
min1.0000004.0000006.0000002.0000000.0000000.00.0000000.0000000.000000368.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%15715.7500004.00000035.0000003.0000002599.0000000.00.0000001024.000000856.0000004874.000000...0.0000000.0000000.0000002856.000000763.0000000.2500000.2857140.2567620.1793100.000000
50%31476.5000004.00000041.0000007.0000005816.0000000.00.0000002832.0000002838.00000010200.000000...0.0000000.0000000.0000006457.0000002929.5000000.5000000.5000000.5116770.4802690.000000
75%47247.2500004.00000048.00000015.00000013002.2500000.00.0000006617.0000006928.00000021522.500000...0.000000381.0000000.00000014478.2500007609.0000000.7142860.7500000.8042330.7306841.000000
max62988.0000006.000000110.000000213.000000505308.0000000.074460.000000239560.000000234188.000000580717.000000...984938.000000984938.000000728282.000000985572.000000728282.0000001.0000001.0000000.9999890.999993140.000000

8 rows × 36 columns

data.to_csv('air_data_cleaned.csv') # 导出结果
data1 = pd.DataFrame(columns=['R', 'L', 'F', 'M', 'C'])
# b['time_interval']=pd.to_datetime(b['xxx'])-pd.to_datetime(b['xxx'])
data1['L'] = pd.to_datetime(data.LOAD_TIME)-pd.to_datetime(data.FFP_DATE)
data1['R'] = data.LAST_TO_END
data1['M'] = data.SEG_KM_SUM
data1['C'] = data.avg_discount
data1['F'] = data.FLIGHT_COUNT
data1.info()
data1['L'] = data1['L']/np.timedelta64(1, 'D') # 将日期转化为数值,单位为天
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62044 entries, 0 to 62978
Data columns (total 5 columns):
R    62044 non-null int64
L    62044 non-null timedelta64[ns]
F    62044 non-null int64
M    62044 non-null int64
C    62044 non-null float64
dtypes: float64(1), int64(3), timedelta64[ns](1)
memory usage: 2.8 MB
data2 = (data1-data1.mean(axis=0))/(data1.std(axis=0)) #标准差标准化 
from sklearn.cluster import KMeans
k=5
kmodel = KMeans(n_clusters = k, n_jobs=4)
kmodel.fit(data2)print(kmodel.cluster_centers_)
print(kmodel.labels_)
[[-0.37722119  1.16066672 -0.08691852 -0.09484404 -0.1559046 ][ 1.68625847 -0.31367829 -0.57401599 -0.53682019 -0.1733261 ][-0.41488827 -0.70020646 -0.16114258 -0.16095751 -0.25513154][-0.00266813  0.05184279 -0.22680311 -0.23125407  2.19134701][-0.79938326  0.48332845  2.4832016   2.42472391  0.30863003]]
[4 4 4 ... 2 1 1]
labels = kmodel.labels_
df1 = pd.DataFrame(labels, columns = ['numbers'])
df2 = pd.DataFrame(kmodel.cluster_centers_, columns=data1.columns)
df3 = df1['numbers'].value_counts()
df4 = pd.concat([df3, df2], axis=1)
# df4
df4
numbersRLFMC
015740-0.3772211.160667-0.086919-0.094844-0.155905
1121251.686258-0.313678-0.574016-0.536820-0.173326
224659-0.414888-0.700206-0.161143-0.160958-0.255132
34184-0.0026680.051843-0.226803-0.2312542.191347
45336-0.7993830.4833282.4832022.4247240.308630
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import font_manager
my_font = font_manager.FontProperties(fname="C:\Windows\Fonts\simhei.ttf")def plot_radar(data):'''the first column is the number of each cluster;the last are those to describe the center of each cluster.'''kinds = list(df4.index)kinds1=['客户群1', '客户群2', '客户群3', '客户群4', '客户群5']labels = data.iloc[:, 1:].columnssam = ['r-.', 'o-.', 'g--', 'b-', 'p:'] # 样式centers = pd.concat([data.iloc[:, 1:], data.iloc[:,1]], axis=1)  # 形成闭环centers = np.array(centers)n = len(labels)angles = np.linspace(0, 2*np.pi, n, endpoint=False)angles = np.concatenate((angles, [angles[0]])) fig = plt.figure(figsize=(7,7),dpi=130)ax = fig.add_subplot(111, polar=True) # 设置坐标为极坐标# 画若干个五边形floor = np.floor(centers.min())     # 大于最小值的最大整数ceil = np.ceil(centers.max())       # 小于最大值的最小整数for i in np.arange(floor, ceil + 0.5, 0.5):ax.plot(angles, [i] * (n + 1), '--', lw=0.5 , color='black')# 画不同客户群的分割线for i in range(n):ax.plot([angles[i], angles[i]], [floor, ceil], '--', lw=0.5, color='black')# 画不同的客户群所占的大小for i in range(len(kinds)):ax.plot(angles, centers[i], sam[i], lw=2, label=kinds1[i])ax.fill(angles, centers[i], alpha=0.25)  # ax.set_thetagrids(angles * 180 / np.pi, labels) # 设置显示的角度,将弧度转换为角度plt.title('客户群特征分布图', font_properties=my_font) # 添加标题plt.legend(prop=my_font, loc='lower right', bbox_to_anchor=(1.5, 0.0)) # 设置图例的位置,在画布外ax.set_theta_zero_location('N')        # 设置极坐标的起点(即0°)在正北方向,即相当于坐标轴逆时针旋转90°ax.spines['polar'].set_visible(False)  # 不显示极坐标最外圈的圆ax.set_ylim(floor,ceil) # 设置雷达图的范围 不设置中心就变成了五星为中心ax.grid(False) # 不显示默认的分割线ax.set_yticks( [i for i in np.arange(floor, ceil, 0.5)])                      # 显示坐标间隔plt.show()
plot_radar(df4)

更多推荐

《Python数据分析与挖掘实战》第7章

本文发布于:2023-06-28 19:49:54,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/932658.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:实战   数据   Python

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!