pandas :如何为数据框中至少出现n次的重复项过滤数据框

编程入门行业动态更新时间:2024-10-25 08:15:52

本文介绍了 pandas :如何为数据框中至少出现n次的重复项过滤数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个Pandas DataFrame，其中包含重复的条目；有些项目被列出两次或三次.我想过滤它，使其仅显示至少列出n次的项目:

I have a Pandas DataFrame that contains duplicate entries; some items are listed twice or three times. I would like to filter it so that it only shows items that are listed at least n times:

DataFrame包含3列:['colA'，'colB'，'colC'].在确定是否多次列出该项目时，应仅考虑"colB".
注意:这不是 drop_duplicates().相反，我想删除数据框中少于n次的项目.
最终结果应仅将每个项目列出一次.

the DataFrame contains 3 columns: ['colA', 'colB', 'colC']. It should only consider 'colB' in determining whether the item is listed multiple times.
Note: this is not drop_duplicates(). It's the opposite, I would like to drop items that are in the dataframe less than n times.
The end result should list each item only once.

推荐答案

您可以使用 value_counts 获取项目计数，然后从中构造一个布尔掩码，并使用 isin :

You can use value_counts to get the item count and then construct a boolean mask from this and reference the index and test membership using isin:

In [3]: df = pd.DataFrame({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]}) df Out[3]: a 0 0 1 0 2 0 3 1 4 2 5 2 6 3 7 3 8 3 9 3 10 3 11 3 12 4 13 4 14 4 In [8]: df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)] Out[8]: a 0 0 1 0 2 0 6 3 7 3 8 3 9 3 10 3 11 3 12 4 13 4 14 4

因此，请打破上面的内容:

So breaking the above down:

In [9]: df['a'].value_counts() > 2 Out[9]: 3 True 4 True 0 True 2 False 1 False Name: a, dtype: bool In [10]: # construct a boolean mask df['a'].value_counts()[df['a'].value_counts()>2] Out[10]: 3 6 4 3 0 3 Name: a, dtype: int64 In [11]: # we're interested in the index here, pass this to isin df['a'].value_counts()[df['a'].value_counts()>2].index Out[11]: Int64Index([3, 4, 0], dtype='int64')

编辑

@JonClements用户建议一种更简单，更快捷的方法是在感兴趣的列上 groupby 和过滤器它:

As user @JonClements suggested a simpler and faster method would be to groupby on the col of interest and filter it:

In [4]: df.groupby('a').filter(lambda x: len(x) > 2) Out[4]: a 0 0 1 0 2 0 6 3 7 3 8 3 9 3 10 3 11 3 12 4 13 4 14 4

编辑2

对于每个重复调用仅获取一个条目 drop_duplicates 并传递参数 subset ='a':

To get just a single entry for each repeat call drop_duplicates and pass param subset='a':

In [2]: df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a') Out[2]: a 0 0 6 3 12 4

更多推荐

pandas :如何为数据框中至少出现n次的重复项过滤数据框

本文发布于:2023-11-30 04:02:03，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1648649.html