我之前问过一个排序问题,有人解决了它首先使用 DataFrame.sort_values 两列然后添加 GroupBy.head.
I asked a sorting problem before, and someone solved it use DataFrame.sort_values by both columns first and then add GroupBy.head.
数据框分类排序优化问题
现在我遇到了一个更复杂的排序.我需要按 category 对数据框进行分类.每个category在class的data2的值最大时,根据data1的值进行过滤,然后排序
Now I encounter a more complicated sorting. I need to classify the dataframe by category. Each category is filtered according to the value of data1 when the value of data2 of the class is the largest, and then sorted
代码如下,如何优化?
import numpy as np import pandas as pd df = pd.DataFrame() n = 200 df['category'] = np.random.choice(('A', 'B'), n) df['data1'] = np.random.rand(len(df))*100 df['data2'] = np.random.rand(len(df))*100 a = df[df['category'] == 'A'] c = a[a['data2'] == a.data2.max()].data1.max() a = a[a['data1'] <= c] a = a.sort_values(by='data1', ascending=False).head(4) b = df[df['category'] == 'B'] c = b[b['data2'] == b.data2.max()].data1.max() b = b[b['data1'] <= c] b = b.sort_values(by='data1', ascending=False).head(4) df = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True) print(df) category data1 data2 0 A 28.194042 98.813271 1 A 26.635099 82.768130 2 A 24.345177 80.558532 3 A 24.222105 89.596726 4 B 60.883981 98.444699 5 B 49.934815 90.319787 6 B 10.751913 86.124271 7 B 4.029914 89.802120我用groupby,感觉代码太复杂了,能不能优化一下?
I use groupby, I feel the code is too complicated, can it be optimized?
import numpy as np import pandas as pd df = pd.DataFrame() n = 200 df['category'] = np.random.choice(('A', 'B'), n) df['data1'] = np.random.rand(len(df))*100 df['data2'] = np.random.rand(len(df))*100 a = df[df['category'] == 'A'] c = a[a['data2'] == a.data2.max()].data1.max() a = a[a['data1'] <= c] a = a.sort_values(by='data1', ascending=False).head(4) b = df[df['category'] == 'B'] c = b[b['data2'] == b.data2.max()].data1.max() b = b[b['data1'] <= c] b = b.sort_values(by='data1', ascending=False).head(4) df2 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True) df3 = df.groupby('category').apply(lambda x: x[x['data1'].isin(x[x['data1'] <= x[x['data2'] == x['data2'].max()].data1.max()]['data1'].nlargest(4))]).reset_index(drop=True) df3 = df3.sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True) print((df2.data1-df3.data1).max()) print((df2.data2-df3.data2).max()) 0.0 0.0 推荐答案使用:
df = pd.DataFrame() n = 200 df['category'] = np.random.choice(('A', 'B'), n) df['data1'] = np.random.rand(len(df))*100 df['data2'] = np.random.rand(len(df))*100 a = df[df['category'] == 'A'] c = a[a['data2'] == a.data2.max()].data1.max() a = a[a['data1'] <= c] a = a.sort_values(by='data1', ascending=False).head(4) b = df[df['category'] == 'B'] c = b[b['data2'] == b.data2.max()].data1.max() b = b[b['data1'] <= c] b = b.sort_values(by='data1', ascending=False).head(4) df1 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True) print(df1) category data1 data2 0 A 87.560430 99.262452 1 A 85.798945 99.200321 2 A 68.614311 97.796274 3 A 41.641961 95.544980 4 B 69.937691 99.711156 5 B 56.932784 99.227111 6 B 19.903620 94.389186 7 B 12.701288 98.455274这里首先通过每组最大data2获取所有data1,通过<=过滤,最后使用groupby.head:
Here are first get all data1 by maximal data2 per groups, filtered by <= and last used groupby.head:
s = (df.sort_values('data2') .drop_duplicates('category', keep='last') .set_index('category')['data1']) df = df[df['data1'] <= df['category'].map(s)] df1 = (df.sort_values(by=['category', 'data1'], ascending=[True, False]) .groupby('category') .head(4) .reset_index(drop=True)) print (df1) category data1 data2 0 A 87.560430 99.262452 1 A 85.798945 99.200321 2 A 68.614311 97.796274 3 A 41.641961 95.544980 4 B 69.937691 99.711156 5 B 56.932784 99.227111 6 B 12.701288 98.455274 7 B 19.903620 94.389186更多推荐
Dataframe分类排序优化问题2
发布评论