Spark DataFrames的drop duplicates方法不起作用,我认为这是因为作为我的数据集一部分的索引列被视为一列数据。 肯定有重复,我通过在除索引之外的所有列上比较COUNT()和COUNT(DISTINCT())来检查它。 我是Spark DataFrames的新手,但如果我使用Pandas,此时我会在该列上执行pandas.DataFrame.set_index 。
有谁知道如何处理这种情况?
其次,Spark DataFrame上有两种方法, drop_duplicates和dropDuplicates 。 它们是一样的吗?
The drop duplicates methods of Spark DataFrames is not working and I think it is because the index column which was part of my dataset is being treated as a column of data. There definitely are duplicates in there, I checked it by comparing COUNT() and COUNT(DISTINCT()) on all the columns except the index. I'm new to Spark DataFrames but if I was using Pandas, at this point I would do pandas.DataFrame.set_index on that column.
Does anyone know how to handle this situation?
Secondly, there appears to be 2 methods on a Spark DataFrame, drop_duplicates and dropDuplicates. Are they the same?
最满意答案
如果您不希望在检查不同记录时考虑索引列,则可以使用以下命令删除列,或仅选择所需的列。
df = df.drop('p_index') // Pass column name to be dropped df = df.select('name', 'age') // Pass the required columnsdrop_duplicates()是dropDuplicates()的别名。
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates
If you don't want the index column to be considered while checking for the distinct records, you can drop the column using below command or select only the columns required.
df = df.drop('p_index') // Pass column name to be dropped df = df.select('name', 'age') // Pass the required columnsdrop_duplicates() is an alias for dropDuplicates().
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates
更多推荐
发布评论