在 pyspark 的 collect

编程入门行业动态更新时间:2024-10-19 06:17:37

本文介绍了在 pyspark 的 collect_list 中包含空值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我试图在使用 pyspark 时在 collect_list 中包含空值，但是 collect_list 操作排除了 nulls.我查看了以下帖子 Pypsark- 使用 collect_list 时保留空值.但是，给出的答案不是我想要的.

I am trying to include null values in collect_list while using pyspark, however the collect_list operation excludes nulls. I have looked into the following post Pypsark - Retain null values when using collect_list . However, the answer given is not what I am looking for.

我有一个这样的数据帧 df.

I have a dataframe df like this.

| id | family | date | ---------------------------- | 1 | Prod | null | | 2 | Dev | 2019-02-02 | | 3 | Prod | 2017-03-08 |

这是我目前的代码:

df.groupby("family").agg(f.collect_list("date").alias("entry_date"))

这给了我这样的输出:

| family | date | ----------------------- | Prod |[2017-03-08]| | Dev |[2019-02-02]|

我真正想要的是:

| family | date | ----------------------------- | Prod |[null, 2017-03-08]| | Dev |[2019-02-02] |

有人可以帮我解决这个问题吗?谢谢！

Can someone please help me with this? Thank you!

推荐答案

一个可能的解决方法是用另一个值替换所有空值.(也许不是最好的方法，但它仍然是一个解决方案)

A possible workaround for this could be to replace all null-values with another value. (Perhaps not the best way to do this, but it's a solution nonetheless)

df = df.na.fill("my_null") # Replace null with "my_null" df = df.groupby("family").agg(f.collect_list("date").alias("entry_date"))

应该给你:

| family | date | ----------------------------- | Prod |[my_null, 2017-03-08]| | Dev |[2019-02-02] |

更多推荐

在 pyspark 的 collect

本文发布于:2023-11-22 05:48:21，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1616199.html