设置数据框之间的列差异

编程入门行业动态更新时间:2024-10-27 13:28:21

本文介绍了设置数据框之间的列差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

注意:这个问题的灵感来自另一篇文章中讨论的想法:Pandas 中的 DataFrame 代数

Note: This question is inspired by the ideas discussed in this other post: DataFrame algebra in Pandas

假设我有两个数据框 A 和 B 并且对于某些列 col_name，它们的值为:

Say I have two dataframes A and B and that for some column col_name, their values are:

A[col_name] | B[col_name] --------------| ------------ 1 | 3 2 | 4 3 | 5 4 | 6

我想根据 col_name 计算 A 和 B 之间的集合差.这个操作的结果应该是:

I want to compute the set difference between A and B based on col_name. The result of this operation should be:

A 的行，其中 A[col_name] 与 B[col_name] 中的任何条目都不匹配.

The rows of A where A[col_name] didn't match any entries in B[col_name].

以下是上述示例的结果(也显示了 A 的其他列):

Below is the result for the above example (showing other columns of A as well):

A[col_name] | A[other_column_1] | A[other_column_2] ------------+-------------------|------------------ 1 | 'foo' | 'xyz' .... 2 | 'bar' | 'abc'

请记住，A[col_name] 和 B[col_name] 中的某些条目可能包含值 np.NaN.我想将这些条目视为未定义但不同的，即集合差异应该返回它们.

Keep in mind that some entries in A[col_name] and B[col_name] could hold the value np.NaN. I would like to treat those entries as undefined BUT different, i.e. the set difference should return them.

我怎样才能在 Pandas 中做到这一点?(概括为多列上的差异也很好)

How can I do this in Pandas? (generalizing to a difference on multiple columns would be great as well)

推荐答案

一种方法是使用 Series isin 方法:

One way is to use the Series isin method:

In [11]: df1 = pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'meh'], [4, 'baz']], columns = ['A', 'B']) In [12]: df2 = pd.DataFrame([[3, 'a'], [4, 'b']], columns = ['A', 'C'])

现在可以检查df1['A']中的每一项是否在df2['A']中:

Now you can check whether each item in df1['A'] is in of df2['A']:

In [13]: df1['A'].isin(df2['A']) Out[13]: 0 False 1 False 2 True 3 True Name: A, dtype: bool In [14]: df1[~df1['A'].isin(df2['A'])] # not in df2['A'] Out[14]: A B 0 1 foo 1 2 bar

我认为这也符合您对 NaN 的要求:

I think this does what you want for NaNs too:

In [21]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [3, 'meh'], [np.nan, 'baz']], columns = ['A', 'B']) In [22]: df2 = pd.DataFrame([[3], [np.nan]], columns = ['A']) In [23]: df1[~df1['A'].isin(df2['A'])] Out[23]: A B 0 1.0 foo 1 NaN bar 3 NaN baz

注意:对于大型框架，可能值得将这些列设为索引(以按照另一个问题).

合并两个或多个列的一种方法是使用虚拟列:

One way to merge on two or more columns is to use a dummy column:

In [31]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [4, 'meh'], [np.nan, 'eurgh']], columns = ['A', 'B']) In [32]: df2 = pd.DataFrame([[np.nan, 'bar'], [4, 'meh']], columns = ['A', 'B']) In [33]: cols = ['A', 'B'] In [34]: df2['dummy'] = df2[cols].isnull().any(1) # rows with NaNs in cols will be True In [35]: merged = df1.merge(df2[cols + ['dummy']], how='left') In [36]: merged Out[36]: A B dummy 0 1 foo NaN 1 NaN bar True 2 4 meh False 3 NaN eurgh NaN

布尔值存在于 df2 中，True 在合并列之一中具有 NaN.按照您的规范，我们应该删除那些错误的:

The booleans were present in df2, the True has an NaN in one of the merging columns. Following your spec, we should drop those which are False:

In [37]: merged.loc[merged.dummy != False, df1.columns] Out[37]: A B 0 1 foo 1 NaN bar 3 NaN eurgh

不优雅.

更多推荐

设置数据框之间的列差异

本文发布于:2023-10-19 19:09:03，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1508490.html