删除相对于某些行重复的所有行

编程入门行业动态更新时间:2024-10-28 17:23:07

本文介绍了删除相对于某些行重复的所有行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我已经看到了几个类似的问题，但对我的情况却不是令人满意的答案.这是一个示例DataFrame:

I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:

+------+-----+----+ | id|value|type| +------+-----+----+ |283924| 1.5| 0| |283924| 1.5| 1| |982384| 3.0| 0| |982384| 3.0| 1| |892383| 2.0| 0| |892383| 2.5| 1| +------+-----+----+

我只想通过"id" 和"value" 列来识别重复项，然后删除所有实例.

I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.

在这种情况下:

第1行和第2行是重复的(同样，我们忽略了类型"列)
第3行和第4行是重复的，因此只有第5行和第6行.应该保留6个:

输出为:

+------+-----+----+ | id|value|type| +------+-----+----+ |892383| 2.5| 1| |892383| 2.0| 0| +------+-----+----+

我尝试过

df.dropDuplicates(subset = ['id', 'value'], keep = False)

但是保持"功能不在PySpark中(因为它在 pandas.DataFrame.drop_duplicates .

But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.

我还能怎么做?

推荐答案

您可以使用窗口函数来完成此操作

You can do that using the window functions

from pyspark.sql import Window, functions as F df.withColumn( 'fg', F.count("id").over(Window.partitionBy("id", "value")) ).where("fg = 1").drop("fg").show()

更多推荐

删除相对于某些行重复的所有行

本文发布于:2023-10-19 15:13:22，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1507903.html