我试图在使用 pyspark 时在 collect_list 中包含空值,但是 collect_list 操作排除了 nulls.我查看了以下帖子 Pypsark- 使用 collect_list 时保留空值.但是,给出的答案不是我想要的.
I am trying to include null values in collect_list while using pyspark, however the collect_list operation excludes nulls. I have looked into the following post Pypsark - Retain null values when using collect_list . However, the answer given is not what I am looking for.
我有一个这样的数据帧 df.
I have a dataframe df like this.
| id | family | date | ---------------------------- | 1 | Prod | null | | 2 | Dev | 2019-02-02 | | 3 | Prod | 2017-03-08 |这是我目前的代码:
df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
这给了我这样的输出:
| family | date | ----------------------- | Prod |[2017-03-08]| | Dev |[2019-02-02]|我真正想要的是:
| family | date | ----------------------------- | Prod |[null, 2017-03-08]| | Dev |[2019-02-02] |有人可以帮我解决这个问题吗?谢谢!
Can someone please help me with this? Thank you!
推荐答案一个可能的解决方法是用另一个值替换所有空值.(也许不是最好的方法,但它仍然是一个解决方案)
A possible workaround for this could be to replace all null-values with another value. (Perhaps not the best way to do this, but it's a solution nonetheless)
df = df.na.fill("my_null") # Replace null with "my_null" df = df.groupby("family").agg(f.collect_list("date").alias("entry_date"))应该给你:
| family | date | ----------------------------- | Prod |[my_null, 2017-03-08]| | Dev |[2019-02-02] |更多推荐
在 pyspark 的 collect
发布评论