使用应用函数汇总具有不明确列的数据帧(Summarizing Dataframes with ambiguous columns with apply function)

我有一个数据框和一个带字典的for循环来定义如何处理上一个问题中的特定列名： Pandas根据存在的列生成数据帧

import pandas as pd df=pd.DataFrame({'Players': [ 'Sam', 'Greg', 'Steve', 'Sam', 'Greg', 'Steve', 'Greg', 'Steve', 'Greg', 'Steve'], 'Wins': [10,5,5,20,30,20,6,9,3,10], 'Losses': [5,5,5,2,3,2,16,20,3,12], 'Type': ['A','B','B','B','A','B','B','A','A','B'], }) p=df.groupby('Players') sumdict = {'Total Games': (None, 'count'), 'Average Wins': ('Wins', 'mean'), 'Greatest Wins': ('Wins', 'max'), 'Unique games': ('Type', 'nunique'), 'Max Score': ('Score', 'max')} summary = [] for key, (column, op) in sumdict.items(): if column is None: res = p.agg(op).max(axis=1) elif column not in df: continue else: res = p[column].agg(lambda x: getattr(x, op)()) summary.append(pd.DataFrame({key: res})) summary = pd.concat(summary, axis=1)

该代码适用于几乎所有情况，除了apply计算列内特定情况的函数：

streak = pd.DataFrame({'Streak':p.Wins.apply(lambda x: (x > 5).sum())})

有没有办法将apply函数合并到字典sumdict ？

I have a dataframe and a for loop with dictionary to define how to handle specific column names from my previous question: Pandas Generating dataframe based on columns being present

import pandas as pd df=pd.DataFrame({'Players': [ 'Sam', 'Greg', 'Steve', 'Sam', 'Greg', 'Steve', 'Greg', 'Steve', 'Greg', 'Steve'], 'Wins': [10,5,5,20,30,20,6,9,3,10], 'Losses': [5,5,5,2,3,2,16,20,3,12], 'Type': ['A','B','B','B','A','B','B','A','A','B'], }) p=df.groupby('Players') sumdict = {'Total Games': (None, 'count'), 'Average Wins': ('Wins', 'mean'), 'Greatest Wins': ('Wins', 'max'), 'Unique games': ('Type', 'nunique'), 'Max Score': ('Score', 'max')} summary = [] for key, (column, op) in sumdict.items(): if column is None: res = p.agg(op).max(axis=1) elif column not in df: continue else: res = p[column].agg(lambda x: getattr(x, op)()) summary.append(pd.DataFrame({key: res})) summary = pd.concat(summary, axis=1)

The code works for almost all cases except for apply functions that count specific cases inside a column:

streak = pd.DataFrame({'Streak':p.Wins.apply(lambda x: (x > 5).sum())})

Is there a way to incorporate the apply function into the dictionary sumdict?

最满意答案

你有几个选择。

检查一个函数并使用它而不是getattr。只需使用字符串，让函数通过......

IMO 2.有点清洁（尽管可能鲜为人知？）你可以将g.agg("max")作为g.max()的别名。

sumdict["Streak"] = "Wins", lambda x: (x > 5).sum()

并且您执行以下操作，注释行是唯一的更改：

summary = [] for key, (column, op) in sumdict.items(): if column is None: res = p.agg(op).max(axis=1) elif column not in df: continue else: res = p[column].agg(op) # just use the string (or it could be a func) summary.append(pd.DataFrame({key: res})) summary = pd.concat(summary, axis=1)

然后Streak工作得很完美：

In [23]: summary Out[23]: Greatest Wins Total Games Streak Average Wins Unique games Players Greg 30 4 2 11 2 Sam 20 2 2 15 2 Steve 20 4 3 11 2

You have a couple of options here.

check for a function and use that rather the getattr. just use the string and let the function fall through...

IMO 2. is a little cleaner (although perhaps lesser known?) that you can do g.agg("max") as an alias to g.max().

sumdict["Streak"] = "Wins", lambda x: (x > 5).sum()

and you do the following, the commented line is the only change:

summary = [] for key, (column, op) in sumdict.items(): if column is None: res = p.agg(op).max(axis=1) elif column not in df: continue else: res = p[column].agg(op) # just use the string (or it could be a func) summary.append(pd.DataFrame({key: res})) summary = pd.concat(summary, axis=1)

Then Streak works just perfect:

In [23]: summary Out[23]: Greatest Wins Total Games Streak Average Wins Unique games Players Greg 30 4 2 11 2 Sam 20 2 2 15 2 Steve 20 4 3 11 2

更多推荐