自定义pySpark中的大型数据集比较

编程入门行业动态更新时间:2024-10-14 22:14:00

本文介绍了自定义pySpark中的大型数据集比较的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在使用下面的代码比较两个数据框并确定差异.但是，我注意到我只是在覆盖我的值(combine_df).我的目标是如果行值不同则标记.但不确定我在做什么错.

I'm using the code below to compare two dataframe and identified differences. However, I'm noticing that I'm simply overwriting my values ( combine_df). My goal is to Flag if row values are different. But not sure what I"m doing wrong.

#Find the overlapping columns in order to compare their values cols = set(module_df.columns) & (set(expected_df.columns))

#create filter dataframes only with the overlapping columns filter_module = expected_df.select(list(cols)) filter_expected = expected_df.select(list(cols))

#create Flag columns to serve as identifier filter_module = filter_module.withColumn('FLAG',lit('module')) filter_expected = filter_expected.withColumn('FLAG',lit('expected')) #join dataframes combine_df = filter_module.union(filter_expected) #get column names in order to iterate/partition through combine_cols = combine_df.columns combine_cols.remove('FLAG')

#leverage Windows function my_window = Window.partitionBy(combine_cols).rowsBetween(-sys.maxsize, sys.maxsize) #dataframe with validation flag combine_df = combine_df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()

module_df

+---------------+------------------------+-----------+-----------------+------------------------------+ | ClientLeaseID|ConformedLeaseStatusName|LeaseStatus|LeaseRecoveryType|ConformedLeaseRecoveryTypeName| +---------------+------------------------+-----------+-----------------+------------------------------+ | 9755| Not Specified at ...| Budweiser| N/A| Gross| | 10065| Not Specified at ...| null| N/A| Gross| | 20133| Active| N| N/A| Gross| | 20134| Active| N| N/A| Gross| | 10141| Not Specified at ...| null| N/A| Gross| |US5643ENTLAN.01| Active| N| GROSS| Gross| | 10895| Active| N| N/A| Gross| | 10590| Active| N| N/A| Gross|

expected_df

+---------------+------------------------+-----------+-----------------+------------------------------+ | ClientLeaseID|ConformedLeaseStatusName|LeaseStatus|LeaseRecoveryType|ConformedLeaseRecoveryTypeName| +---------------+------------------------+-----------+-----------------+------------------------------+ | 9755| Lookup Not Found| Budweiser| N/A| Lookup Not Found| | 10065| Not Specified at ...| null| N/A| Lookup Not Found| | 20133| Active| N| N/A| Lookup Not Found| | 20134| Active| N| N/A| Lookup Not Found| | 10141| Not Specified at ...| null| N/A| Lookup Not Found| |US5643ENTLAN.01| Active| N| GROSS| Lookup Not Found| | 10895| Active| N| N/A| Lookup Not Found| | 10590| Active| N| N/A| Lookup Not Found|

combine_df

推荐答案

您是否使用了正确的df

Have you used correct df

#instead of this filter_module = expected_df.select(list(cols)) filter_expected = expected_df.select(list(cols)) #use this filter_module = module_df.select(list(cols)) filter_expected = expected_df.select(list(cols))

更多推荐

自定义pySpark中的大型数据集比较

本文发布于:2023-11-21 23:45:32，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1615142.html