Multiindex Pandas Groupby +聚合，保持完整索引

编程入门行业动态更新时间:2024-10-25 22:35:32

本文介绍了Multiindex Pandas Groupby +聚合，保持完整索引的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个二级分层索引的整数序列.

I have a two-level hierarchically-indexed sequence of integers.

>> s id1 id2 1 a 100 b 10 c 9 2 a 2000 3 a 5 b 10 c 15 d 20 ...

我想按id1分组，然后选择最大值，但结果中要有 full 索引.我尝试了以下方法:

I want to group by id1, and select the maximum value, but have the full index in the result. I have tried the following:

>> s.groupby(level=0).aggregate(np.max) id1 1 100 2 2000 3 20

但是结果仅由id1索引.我希望我的输出看起来像这样:

But result is indexed by id1 only. I want my output to look like this:

id1 id2 1 a 100 2 a 2000 3 d 20

在此提出了一个相关但更复杂的问题: 多索引熊猫groupby，忽略一个级别吗? 正如它所指出的，答案有点像破解.

A related, but more complicated, question was asked here: Multiindexed Pandas groupby, ignore a level? As it states, the answer is kind of a hack.

有人知道更好的解决方案吗?如果不是，那么id2的每个值都是唯一的特殊情况呢?

Does anyone know a better solution? If not, what about the special case where every value of id2 is unique?

推荐答案

在groupby之后选择完整行的一种方法是使用groupby/transform构建布尔掩码，然后使用掩码从:

One way to select full rows after a groupby is to use groupby/transform to build a boolean mask and then use the mask to select the full rows from s:

In [110]: s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)] Out[110]: id1 id2 1 a 100 2 a 2000 3 d 20 Name: s, dtype: int64

另一种在某些情况下(例如，当有很多组时)更快的方法是将最大值m与s中的值合并到DataFrame中，然后选择行基于m和s之间的相等性:

Another way, which is faster in some cases -- such as when there are a lot of groups -- is to merge the max values m into a DataFrame along with the values in s, and then select rows based on equality between m and s:

def using_merge(s): m = s.groupby(level=0).agg(np.max) df = s.reset_index(['id2']) df['m'] = m result = df.loc[df['s']==df['m']] del result['m'] result = result.set_index(['id2'], append=True) return result['s']

下面是显示using_merge的示例，虽然更复杂，但可能比using_transform快:

Here is an example showing using_merge, while more complicated, may be faster than using_transform:

import numpy as np import pandas as pd def using_transform(s): return s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)] N = 10**5 id1 = np.random.randint(100, size=N) id2 = np.random.choice(list('abcd'), size=N) index = pd.MultiIndex.from_arrays([id1, id2]) ss = pd.Series(np.random.randint(100, size=N), index=index) ss.index.names = ['id1', 'id2'] ss.name = 's'

使用IPython的%timeit函数对这两个函数进行计时会产生:

Timing these two functions using IPython's %timeit function yields:

In [121]: %timeit using_merge(ss) 100 loops, best of 3: 12.8 ms per loop In [122]: %timeit using_transform(ss) 10 loops, best of 3: 45 ms per loop

更多推荐

Multiindex Pandas Groupby +聚合,保持完整索引

本文发布于:2023-11-22 18:45:19，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1618575.html