自定义pySpark中的大型数据集比较

编程入门 行业动态 更新时间:2024-10-14 22:14:00
本文介绍了自定义pySpark中的大型数据集比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在使用下面的代码比较两个数据框并确定差异.但是,我注意到我只是在覆盖我的值(combine_df).我的目标是如果行值不同则标记.但不确定我在做什么错.

I'm using the code below to compare two dataframe and identified differences. However, I'm noticing that I'm simply overwriting my values ( combine_df). My goal is to Flag if row values are different. But not sure what I"m doing wrong.

#Find the overlapping columns in order to compare their values cols = set(module_df.columns) & (set(expected_df.columns))

#create filter dataframes only with the overlapping columns filter_module = expected_df.select(list(cols)) filter_expected = expected_df.select(list(cols))

#create Flag columns to serve as identifier filter_module = filter_module.withColumn('FLAG',lit('module')) filter_expected = filter_expected.withColumn('FLAG',lit('expected')) #join dataframes combine_df = filter_module.union(filter_expected) #get column names in order to iterate/partition through combine_cols = combine_df.columns combine_cols.remove('FLAG')

#leverage Windows function my_window = Window.partitionBy(combine_cols).rowsBetween(-sys.maxsize, sys.maxsize) #dataframe with validation flag combine_df = combine_df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()

module_df

+---------------+------------------------+-----------+-----------------+------------------------------+ | ClientLeaseID|ConformedLeaseStatusName|LeaseStatus|LeaseRecoveryType|ConformedLeaseRecoveryTypeName| +---------------+------------------------+-----------+-----------------+------------------------------+ | 9755| Not Specified at ...| Budweiser| N/A| Gross| | 10065| Not Specified at ...| null| N/A| Gross| | 20133| Active| N| N/A| Gross| | 20134| Active| N| N/A| Gross| | 10141| Not Specified at ...| null| N/A| Gross| |US5643ENTLAN.01| Active| N| GROSS| Gross| | 10895| Active| N| N/A| Gross| | 10590| Active| N| N/A| Gross|

expected_df

+---------------+------------------------+-----------+-----------------+------------------------------+ | ClientLeaseID|ConformedLeaseStatusName|LeaseStatus|LeaseRecoveryType|ConformedLeaseRecoveryTypeName| +---------------+------------------------+-----------+-----------------+------------------------------+ | 9755| Lookup Not Found| Budweiser| N/A| Lookup Not Found| | 10065| Not Specified at ...| null| N/A| Lookup Not Found| | 20133| Active| N| N/A| Lookup Not Found| | 20134| Active| N| N/A| Lookup Not Found| | 10141| Not Specified at ...| null| N/A| Lookup Not Found| |US5643ENTLAN.01| Active| N| GROSS| Lookup Not Found| | 10895| Active| N| N/A| Lookup Not Found| | 10590| Active| N| N/A| Lookup Not Found|

combine_df

+-------------+------------------------+-----------+-----------------+------------------------------+----+ |ClientLeaseID|ConformedLeaseStatusName|LeaseStatus|LeaseRecoveryType|ConformedLeaseRecoveryTypeName|FLAG| +-------------+------------------------+-----------+-----------------+------------------------------+----+ | 9755| Lookup Not Found| Budweiser| N/A| Lookup Not Found|SAME| +-------------+------------------------+-----------+-----------------+------------------------------+---- 10065| Not Specified at ...| null| N/A| Lookup Not Found|SAME +-------------+------------------------+-----------+-----------------+------------------------------+----+ | 20133| Active| N| N/A| Lookup Not Found|SAME| +-------------+------------------------+-----------+-----------------+------------------------------+----+-------------+------------------------+-----------+-----------------+------------------------------+----+ | 20134| Active| N| N/A| Lookup Not Found|SAME| +-------------+------------------------+-----------+-----------------+------------------------------+----+-------------+------------------------+-----------+-----------------+------------------------------+----+ | 10141| Not Specified at ...| null| N/A| Lookup Not Found|SAME| +-------------+------------------------+-----------+-----------------+------------------------------+----+---------------+------------------------+-----------+-----------------+------------------------------+----+ |US5643ENTLAN.01| Active| N| GROSS| Lookup Not Found|SAME| +---------------+------------------------+-----------+-----------------+------------------------------+----+

推荐答案

您是否使用了正确的df

Have you used correct df

#instead of this filter_module = expected_df.select(list(cols)) filter_expected = expected_df.select(list(cols)) #use this filter_module = module_df.select(list(cols)) filter_expected = expected_df.select(list(cols))

更多推荐

自定义pySpark中的大型数据集比较

本文发布于:2023-11-21 23:45:32,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1615142.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:自定义   数据   pySpark

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!