我有一个如下数据库:
并且我想获得一个熊猫数据框,该数据框基于日期中人口最多的前2行进行过滤.输出应如下所示:
And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:
我知道熊猫提供了一个称为nlargest的公式: pandas.pydata/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html
I know that pandas offers a formula called nlargest: pandas.pydata/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html
但是我认为它不适用于此用例.有什么解决方法吗?
but I don't think it is usable for this use case. Is there any workaround?
非常感谢!
推荐答案我模仿了您的数据框,如下所示,并提供了一种前进的方式来获得所需的数据,希望对您有所帮助.
I have mimicked your dataframe as below and provided a way forward to get the desired, hope that will helpful.
>>> df Date country population 0 2019-12-31 A 100 1 2019-12-31 B 10 2 2019-12-31 C 1000 3 2020-01-01 A 200 4 2020-01-01 B 20 5 2020-01-01 C 3500 6 2020-01-01 D 12 7 2020-02-01 D 2000 8 2020-02-01 E 54您所需的解决方案:
您可以将 nlargest 方法与 set_index ans groupby 方法一起使用.
Your Desired Solution:
You can use nlargest method along with set_index ans groupby method.
这就是你会得到的.
>>> df.set_index('country').groupby('Date')['population'].nlargest(2) Date country 2019-12-31 C 1000 A 100 2020-01-01 C 3500 A 200 2020-02-01 D 2000 E 54 Name: population, dtype: int64现在,您希望通过重置DataFrame的索引使DataFrame进入原始状态,这将为您提供以下..
Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index() Date country population 0 2019-12-31 C 1000 1 2019-12-31 A 100 2 2020-01-01 C 3500 3 2020-01-01 A 200 4 2020-02-01 D 2000 5 2020-02-01 E 54另一种解决方法:
通过 groupby 和 apply 函数,将 reset_index 与参数 drop = True 和 level = ..
Another way around:
With groupby and apply function use reset_index with parameter drop=True and level= ..
>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True) # df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True) Date country population 0 2019-12-31 C 1000 1 2019-12-31 A 100 2 2020-01-01 C 3500 3 2020-01-01 A 200 4 2020-02-01 D 2000 5 2020-02-01 E 54更多推荐
根据另一列选择前n列
发布评论