数据聚合与分组运算"/>
pandas之数据聚合与分组运算
这部分可以用三个单词概括: split(拆分)--->apply(应用)--->combine(合并)
split:pandas对象中的数据根据你提供的一个或多个键被拆分为多组,在特定的轴上操作。
apply:将一个函数应用到各个分组产生一个新值。
combine: 将apply执行的结果合并到最终的结果对象中。
用图直观描述一下:
分组键可以有多种形式,且类型不必相同:
- 列表或数组,其长度与待分组的轴一样;
- 表示DataFrame某个列名的值;
- 字典或Series给出待分组轴上的值与分组之间的对应关系;
- 函数,用于处理索引或索引中的各个标签;
数据分组
看几个例子吧:
In [81]: df =pd.DataFrame({'key1':list('aabba'),'key2':['one',...: 'two','one','two','one'], 'data1':np.random.randn(5),...: 'data2':np.random.randn(5)}) In [82]: grouped = df['data1'].groupby(df['key1']) # 等价于 df.groupby('key1')['data1'] 返回一个含有分组大小的Series In [83]: grouped.mean() Out[83]: key1 a -0.130183 b 0.558940 Name: data1, dtype: float64
In [90]: grouped = df[['data1']].groupby(df['key1']) # 这样返回的就是一个分组后的 Dataframe 对象 In [91]: grouped.mean() Out[91]: data1 key1 a -0.130183 b 0.558940
In [86]: grouped = df['data1'].groupby([df['key1'],df['key2']] #等价于 df.groupby(['key1','key2'])['data1'].mean() ...: ).mean() In [87]: grouped Out[87]: key1 key2 a one -0.134323two -0.121903 b one -0.222980two 1.340860 Name: data1, dtype: float64
分组键还可以是任何长度适当的数组:
In [94]: states=np.array(['Ohio','California','California','Ohio','Ohio']) In [95]: years=np.array([2005,2005,2006,2005,2006]) In [96]: df['data1'].groupby([states,years]).mean() Out[96]: California 2005 -0.1219032006 -0.222980 Ohio 2005 0.5223582006 0.027499 Name: data1, dtype: float64
通过字典或Series进行分组
In [97]: people =pd.DataFrame(np.random.randn(5,5),columns=['a...: ','b','c','d','e'],index=['Joe','Steve','Wes','Jim',' Travis']) In [99]: people.loc[2:3,['b','c']]=np.nan In [100]: people Out[100]: a b ... d e Joe -0.947326 0.239880 ... 0.872371 -0.735662 Steve -0.347697 -0.770544 ... -0.065098 1.408378 Wes 0.090692 NaN ... 0.514894 1.544698 Jim -0.023477 0.392851 ... -0.730022 2.376998 Travis 1.760169 -0.671978 ... -1.363583 0.284736 [5 rows x 5 columns]
假设已知列的关系,并希望根据分组计算列的总和。 In [102]: mappings={'a':'red','b':'red','c':'blue','d':'blue',...: 'e':'red','f':'orange'} In [103]: by_column=people.groupby(mappings,axis=1) In [104]: by_column Out[104]: <pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000002225F554748> In [105]: by_column.sum() Out[105]: blue red Joe 1.365578 -1.443109 Steve -0.234023 0.290137 Wes 0.514894 1.635390 Jim -2.368476 2.746372 Travis -3.138280 1.372926
Series也有着同样的功能,可以被看做是一个大的映射。 In [106]: map_series=pd.Series(mappings)In [109]: people.groupby(map_series,axis=1).count() Out[109]: blue red Joe 2 3 Steve 2 3 Wes 1 2 Jim 2 3 Travis 2 3
通过函数进行分组
任何被当做分组键的函数都会在各个索引值上被调用一次,其返回值就会被用作分组名称。假设我们想根据人名的长度进行分组,虽然可以求取一个字符串的长度数组,但是其实仅仅传入len函数就可以了:
In [112]: people.groupby(len).sum() Out[112]: a b c d e 3 -0.880111 0.632731 -1.145247 0.657244 3.186034 5 -0.347697 -0.770544 -0.168926 -0.065098 1.408378 6 1.760169 -0.671978 -1.774697 -1.363583 0.284736
将函数跟数组、字典、列表、Series混用都可以 In [113]: key_list=['one','one','one','two','two'] In [115]: people.groupby([len, key_list]).sum() Out[115]: a b ... d e 3 one -0.856634 0.239880 ... 1.387265 0.809036two -0.023477 0.392851 ... -0.730022 2.376998 5 one -0.347697 -0.770544 ... -0.065098 1.408378 6 two 1.760169 -0.671978 ... -1.363583 0.284736 [4 rows x 5 columns]
根据索引级别分组
In [117]: columns=pd.MultiIndex.from_arrays([['US','US','US','...: JP','JP'],[1,3,5,1,3]],names=['city','tenor']) In [118]: hier_df=pd.DataFrame(np.random.randn(4,5),columns=co...: lumns) In [119]: hier_df Out[119]: city US ... JP tenor 1 3 ... 1 3 0 -1.394733 0.729459 ... -0.596068 -0.603338 1 -1.781984 -1.218506 ... 1.466333 -1.819780 2 0.563109 -0.254265 ... -0.084187 -0.231257 3 -0.988175 1.948536 ... 0.661451 1.788520In [120]: hier_df.groupby(level='city',axis=1).count() Out[120]: city JP US 0 2 3 1 2 3 2 2 3 3 2 3
数据聚合
数据聚合是从数组产生标量值的数据转换过程,此过程可以应用 groupby的方法,也可以应用自定义的方法,可以一次应用一个,也可以一次多个。
groupby的方法:
In [81]: df =pd.DataFrame({'key1':list('aabba'),'key2':['one',...: 'two','one','two','one'], 'data1':np.random.randn(5),...: 'data2':np.random.randn(5)}) In [82]: grouped = df['data1'].groupby(df['key1']) # 等价于 df.groupby('key1')['data1'] 返回一个含有分组大小的Series In [83]: grouped.mean() Out[83]: key1 a -0.130183 b 0.558940 Name: data1, dtype: float64
应用自定义的方法:
In [121]: df Out[121]: data1 data2 key1 key2 0 -0.296144 0.110057 a one 1 -0.121903 0.139525 a two 2 -0.222980 0.937050 b one 3 1.340860 0.395284 b two 4 0.027499 -0.668304 a oneIn [122]: gd =df.groupby('key1') In [123]: def peak_to_peak(arr):...: return arr.max() -arr.min() In [124]: gd.agg(peak_to_peak) Out[124]: data1 data2 key1 a 0.323643 0.807829 b 1.563840 0.541766
应用多个方法:
实验用到的数据可以去这个地址拿到,自己处理一下,就当锻炼了:
In [174]: tips = pd.read_csv('D:\workspaces\datatest\in.csv') # 读文件 自己注意路径 In [175]: tips['tip_pct'] = tips['tip']/tips['total_bill'] # 计算 tip_pct In [179]: tips.to_csv('D:\workspaces\datatest\out.csv',index=False) # 写文件
.csv
In [130]: tips = pd.read_table('D:\workspaces\datatest\out.csv',sep='\s+') In [131]: tips Out[131]: total_bill tip sex smoker day time size tip_pct 0 16.99 1.01 Female No Sun Dinner 2 0.059447 1 10.34 1.66 Male No Sun Dinner 3 0.160542 2 21.01 3.50 Male No Sun Dinner 3 0.166587 3 23.68 3.31 Male No Sun Dinner 2 0.139780In [132]: grouped = tips.groupby(['sex','smoker'])In [133]: g_pct = grouped['tip_pct']In [134]: g_pct.agg('mean') Out[134]: sex smoker Female No 0.059447 Male No 0.155636 Name: tip_pct, dtype: float64In [135]: g_pct.agg(['mean','std']) Out[135]: mean std sex smoker Female No 0.059447 NaN Male No 0.155636 0.014061In [136]: g_pct.agg([('m','mean'),('s','std')]) Out[136]: m s sex smoker Female No 0.059447 NaN Male No 0.155636 0.014061In [137]: grouped['tip_pct','total_bill'].agg(['count','mean','max']) Out[137]: tip_pct total_billcount mean max count mean max sex smoker Female No 1 0.059447 0.059447 1 16.990000 16.99 Male No 3 0.155636 0.166587 3 18.343333 23.68
以“无索引”的方式返回聚合数据
In [140]: grouped = tips.groupby(['sex','smoker'],as_index=Fal...: se).mean() In [141]: grouped Out[141]: sex smoker ... size tip_pct 0 Female No ... 2.000000 0.059447 1 Male No ... 2.666667 0.155636[2 rows x 6 columns]
上面的数据聚合只是数据转换的特例,下面介绍transform和apply方法。
下列是为数据集添加一个用于存放各索引分组平均值的列,先聚合再合并
In [33]: df Out[33]: data1 data2 key1 key2 0 -1.828647 0.329238 a one 1 -0.639952 1.532362 a two 2 0.617105 0.906281 b one 3 1.443470 1.419738 b two 4 -2.031339 -0.649743 a one In [34]: k1_means = df.groupby('key1').mean().add_prefix('mean_') In [35]: k1_means Out[35]: mean_data1 mean_data2 key1 a -1.499979 0.403952 b 1.030287 1.163010In [36]: pd.merge(df,k1_means,left_on = 'key1',right_index = True) Out[36]: data1 data2 key1 key2 mean_data1 mean_data2 0 -1.828647 0.329238 a one -1.499979 0.403952 1 -0.639952 1.532362 a two -1.499979 0.403952 4 -2.031339 -0.649743 a one -1.499979 0.403952 2 0.617105 0.906281 b one 1.030287 1.163010 3 1.443470 1.419738 b two 1.030287 1.163010上面的方法虽然也可以完成任务,但是太繁琐,我们再看一个例子:
In [37]: key = ['one','two','one','two','one'] In [41]: people =DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],
index=['Joe','Steve','Wes','Jim','Travis']) In [42]: people Out[42]: a b c d e Joe -1.264617 -1.385894 0.146627 -1.225148 0.627616 Steve 0.880528 0.530060 0.453235 1.160768 -0.053416 Wes 1.033023 -0.859791 -0.629231 -1.094454 -2.073512 Jim 1.777919 -0.864824 -1.940994 -0.806969 0.504503 Travis -1.260144 -0.486910 1.180371 -0.214743 0.629261In [43]: people.groupby(key).mean() Out[43]: a b c d e one -0.497246 -0.910865 0.232589 -0.844782 -0.272212 two 1.329224 -0.167382 -0.743879 0.176899 0.225544In [44]: people.groupby(key).transform(np.mean) Out[44]: a b c d e Joe -0.497246 -0.910865 0.232589 -0.844782 -0.272212 Steve 1.329224 -0.167382 -0.743879 0.176899 0.225544 Wes -0.497246 -0.910865 0.232589 -0.844782 -0.272212 Jim 1.329224 -0.167382 -0.743879 0.176899 0.225544 Travis -0.497246 -0.910865 0.232589 -0.844782 -0.272212In [45]: def demean(arr):...: return arr-arr.mean() In [46]: demeaned = people.groupby(key).transform(demean) In [47]: demeaned Out[47]: a b c d e Joe -0.767371 -0.475029 -0.085962 -0.380366 0.899828 Steve -0.448695 0.697442 1.197114 0.983868 -0.278960 Wes 1.530269 0.051074 -0.861820 -0.249672 -1.801300 Jim 0.448695 -0.697442 -1.197114 -0.983868 0.278960 Travis -0.762898 0.423955 0.947782 0.630038 0.901473上面的例子是不是比第一个简单点。
apply:一般性的“拆分-应用-合并”
transform是一个严格的条件的特殊函数:传入的函数只能产生两种结果,要么产生一个可以广播的标量值,要么产生一个相同大小的结果数组。apply会将待处理的对象拆分为多个片段,然后对各片段调用传入的函数,最后尝试将各个片段组合在一起。
假设我们想从下面数据集里面选出5个最高的tip_pct值。
total_bill tip sex smoker day time size tip_pct # 修改了几个数据 16.99 1.01 Female No Sun Dinner 2 0.059447 10.34 1.66 Male YES Sun Dinner 3 0.160542 21.01 3.50 Female YES Sun Dinner 3 0.166587 23.68 3.31 Male No Sun Dinner 2 0.139780
定义函数:
In [143]: def top(df,n=5,column = 'tip_pct'): ...: return df.sort_values(by=column)[-n:]
In [150]: top(tips,n=3) Out[150]: total_bill tip sex ... time size tip_pct 3 23.68 3.31 Male ... Dinner 2 0.139780 1 10.34 1.66 Male ... Dinner 3 0.160542 2 21.01 3.50 Female ... Dinner 3 0.166587 [3 rows x 8 columns]
In [151]: tips.groupby('smoker').apply(top) Out[151]: total_bill tip ... size tip_pct smoker ... No 0 16.99 1.01 ... 2 0.0594473 23.68 3.31 ... 2 0.139780 YES 1 10.34 1.66 ... 3 0.1605422 21.01 3.50 ... 3 0.166587[4 rows x 8 columns]
top函数在Dataframe的各个片段上调用,然后结果由pandas.concat组装到一起,并以分组名称进行了标记。
加一点特殊条件
In [155]: tips.groupby(['smoker','day']).apply(top,n=1,column = 'total_bill')
禁止分组键:
In [152]: tips.groupby('smoker',group_keys=False).apply(top) Out[152]: total_bill tip sex ... time size tip_pct 0 16.99 1.01 Female ... Dinner 2 0.059447 3 23.68 3.31 Male ... Dinner 2 0.139780 1 10.34 1.66 Male ... Dinner 3 0.160542 2 21.01 3.50 Female ... Dinner 3 0.166587 [4 rows x 8 columns]
分位数与桶分析
我的理解就是:利用 cut() 或 qcut() 函数将数据分成一个一个的数据区间。一个数据区间大概就是一个桶了。
关于上面两个函数可以参考:
离散化和面元划分一节
举个栗子:
In [157]: frame = pd.DataFrame({'data1':np.random.randn(1000),...: 'data2':np.random.randn(1000)}) In [158]: frame Out[158]: data1 data2 0 -0.805422 -0.369037 1 1.197099 -0.749794 ............. 998 -1.361444 -0.456945 999 0.187000 0.214123 [1000 rows x 2 columns] In [160]: quant = pd.cut(frame.data1,4) # 根据 data1 数据的最大值与最小值,将数据区间分成长度相等的4份 In [162]: quant[:2] Out[162]: 0 (-1.226, 0.664] 1 (0.664, 2.553] Name: data1, dtype: category Categories (4, interval[float64]): [(-3.123, -1.226] < (-1.226,0.664] < (0.664, 2.553] (2.553, 4.443]]In [163]: def get_stats(group):...: return {'min':group.min(),...: 'max':group.max(),...: 'count':group.count(),...: 'mean':group.mean()} In [164]: group1 = frame.data2.groupby(quant) In [165]: group1.apply(get_stats).unstack() Out[165]: count max mean min data1 (-3.123, -1.226] 113.0 1.904771 -0.077147 -2.467027 (-1.226, 0.664] 621.0 2.959166 -0.007698 -3.102885 (0.664, 2.553] 261.0 2.622070 0.010818 -2.502551 (2.553, 4.443] 5.0 0.504759 0.020948 -0.919104
根据样本分位数将数据分成大小相等的桶:
In [168]: quant = pd.qcut(frame.data1,4) In [169]: group1 = frame.data2.groupby(quant) In [170]: group1.apply(get_stats).unstack() Out[170]: count ... min data1 ... (-3.116, -0.672] 250.0 ... -2.467027 (-0.672, -0.0194] 250.0 ... -3.102885 (-0.0194, 0.73] 250.0 ... -2.815319 (0.73, 4.443] 250.0 ... -2.502551[4 rows x 4 columns]In [171]: quant = pd.qcut(frame.data1,4,labels=False) # 去除标签In [172]: group1 = frame.data2.groupby(quant)In [173]: group1.apply(get_stats).unstack() Out[173]: count max mean min data1 0 250.0 2.959166 -0.048357 -2.467027 1 250.0 2.401510 0.017249 -3.102885 2 250.0 2.386183 -0.033954 -2.815319 3 250.0 2.622070 0.022783 -2.502551
透视表和交叉表
透视表(pivot table)是由各种电子表格程序和其他数据分析软件一种常见的数据汇总工具。
pivot_table 参数列表
values | Column name or names to aggregate. By default aggregates all numeric columns |
index | Column names or other group keys to group on the rows of the resulting pivot table |
columns | Column names or other group keys to group on the columns of the resulting pivot table |
aggfunc | Aggregation function or list of functions; 'mean' by default. Can be any function valid in a groupby context |
fill_value | Replace missing values in result table |
margins | Add row/column subtotals and grand total, False by default |
In [181]: tips.pivot_table(index=['sex','smoker']) Out[181]: size tip tip_pct total_bill sex smoker Female No 2.592593 2.773519 0.156921 18.105185Yes 2.242424 2.931515 0.182150 17.977879 Male No 2.711340 3.113402 0.160669 19.791237Yes 2.500000 3.051167 0.152771 22.284500 # 只聚合tip_pct和size, In [182]: tips.pivot_table(['tip_pct','size'],index = ['sex','day'],columns='smoker') Out[182]: size tip_pct smoker No Yes No Yes sex day Female Fri 2.500000 2.000000 0.165296 0.209129Sat 2.307692 2.200000 0.147993 0.163817Sun 3.071429 2.500000 0.165710 0.237075Thur 2.480000 2.428571 0.155971 0.163073 Male Fri 2.000000 2.125000 0.138005 0.144730Sat 2.656250 2.629630 0.162132 0.139067Sun 2.883721 2.600000 0.158291 0.173964Thur 2.500000 2.300000 0.165706 0.164417 # 传入margins = True作为分项的一个汇总 In [183]: tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day',margins=True) Out[183]: day Fri Sat Sun Thur All sex smoker Female No 0.165296 0.147993 0.165710 0.155971 0.156921Yes 0.209129 0.163817 0.237075 0.163073 0.182150 Male No 0.138005 0.162132 0.158291 0.165706 0.160669Yes 0.144730 0.139067 0.173964 0.164417 0.152771 All 0.169913 0.153152 0.166897 0.161276 0.160803 # 使用其他聚合函数传给参数 aggfunc In [184]: tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day',aggfunc=len,margins=True) Out[184]: day Fri Sat Sun Thur All sex smoker Female No 2.0 13.0 14.0 25.0 54.0Yes 7.0 15.0 4.0 7.0 33.0 Male No 2.0 32.0 43.0 20.0 97.0Yes 8.0 27.0 15.0 10.0 60.0 All 19.0 87.0 76.0 62.0 244.0 # 用 fill_value 填充 NA 值 In [185]: tips.pivot_table('size',index=['time','sex','smoker'],columns='day',aggfunc='sum',fill_value=0) Out[185]: day Fri Sat Sun Thur time sex smoker Dinner Female No 2 30 43 2Yes 8 33 10 0Male No 4 85 124 0Yes 12 71 39 0 Lunch Female No 3 0 0 60Yes 6 0 0 17Male No 0 0 0 50Yes 5 0 0 23
交叉表(crosstab)是一种计算分组频率的特殊透视表。
In [186]: pd.crosstab([tips.time,tips.day],tips.smoker,margins=True) Out[186]: smoker No Yes All time day Dinner Fri 3 9 12Sat 45 42 87Sun 57 19 76Thur 1 0 1 Lunch Fri 1 6 7Thur 44 17 61 All 151 93 244
更多推荐
pandas之数据聚合与分组运算
发布评论