本文介绍了自定义pySpark中的大型数据集比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在使用下面的代码比较两个数据框并确定差异.但是,我注意到我只是在覆盖我的值(combine_df).我的目标是如果行值不同则标记.但不确定我在做什么错.
I'm using the code below to compare two dataframe and identified differences. However, I'm noticing that I'm simply overwriting my values ( combine_df). My goal is to Flag if row values are different. But not sure what I"m doing wrong.
#Find the overlapping columns in order to compare their values cols = set(module_df.columns) & (set(expected_df.columns))#create filter dataframes only with the overlapping columns filter_module = expected_df.select(list(cols)) filter_expected = expected_df.select(list(cols))
#create Flag columns to serve as identifier filter_module = filter_module.withColumn('FLAG',lit('module')) filter_expected = filter_expected.withColumn('FLAG',lit('expected')) #join dataframes combine_df = filter_module.union(filter_expected) #get column names in order to iterate/partition through combine_cols = combine_df.columns combine_cols.remove('FLAG')
#leverage Windows function my_window = Window.partitionBy(combine_cols).rowsBetween(-sys.maxsize, sys.maxsize) #dataframe with validation flag combine_df = combine_df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
module_df
+---------------+------------------------+-----------+-----------------+------------------------------+ | ClientLeaseID|ConformedLeaseStatusName|LeaseStatus|LeaseRecoveryType|ConformedLeaseRecoveryTypeName| +---------------+------------------------+-----------+-----------------+------------------------------+ | 9755| Not Specified at ...| Budweiser| N/A| Gross| | 10065| Not Specified at ...| null| N/A| Gross| | 20133| Active| N| N/A| Gross| | 20134| Active| N| N/A| Gross| | 10141| Not Specified at ...| null| N/A| Gross| |US5643ENTLAN.01| Active| N| GROSS| Gross| | 10895| Active| N| N/A| Gross| | 10590| Active| N| N/A| Gross|expected_df
+---------------+------------------------+-----------+-----------------+------------------------------+ | ClientLeaseID|ConformedLeaseStatusName|LeaseStatus|LeaseRecoveryType|ConformedLeaseRecoveryTypeName| +---------------+------------------------+-----------+-----------------+------------------------------+ | 9755| Lookup Not Found| Budweiser| N/A| Lookup Not Found| | 10065| Not Specified at ...| null| N/A| Lookup Not Found| | 20133| Active| N| N/A| Lookup Not Found| | 20134| Active| N| N/A| Lookup Not Found| | 10141| Not Specified at ...| null| N/A| Lookup Not Found| |US5643ENTLAN.01| Active| N| GROSS| Lookup Not Found| | 10895| Active| N| N/A| Lookup Not Found| | 10590| Active| N| N/A| Lookup Not Found|combine_df
+-------------+------------------------+-----------+-----------------+------------------------------+----+ |ClientLeaseID|ConformedLeaseStatusName|LeaseStatus|LeaseRecoveryType|ConformedLeaseRecoveryTypeName|FLAG| +-------------+------------------------+-----------+-----------------+------------------------------+----+ | 9755| Lookup Not Found| Budweiser| N/A| Lookup Not Found|SAME| +-------------+------------------------+-----------+-----------------+------------------------------+---- 10065| Not Specified at ...| null| N/A| Lookup Not Found|SAME +-------------+------------------------+-----------+-----------------+------------------------------+----+ | 20133| Active| N| N/A| Lookup Not Found|SAME| +-------------+------------------------+-----------+-----------------+------------------------------+----+-------------+------------------------+-----------+-----------------+------------------------------+----+ | 20134| Active| N| N/A| Lookup Not Found|SAME| +-------------+------------------------+-----------+-----------------+------------------------------+----+-------------+------------------------+-----------+-----------------+------------------------------+----+ | 10141| Not Specified at ...| null| N/A| Lookup Not Found|SAME| +-------------+------------------------+-----------+-----------------+------------------------------+----+---------------+------------------------+-----------+-----------------+------------------------------+----+ |US5643ENTLAN.01| Active| N| GROSS| Lookup Not Found|SAME| +---------------+------------------------+-----------+-----------------+------------------------------+----+推荐答案
您是否使用了正确的df
Have you used correct df
#instead of this filter_module = expected_df.select(list(cols)) filter_expected = expected_df.select(list(cols)) #use this filter_module = module_df.select(list(cols)) filter_expected = expected_df.select(list(cols))更多推荐
自定义pySpark中的大型数据集比较
发布评论