问题描述
限时送ChatGPT账号..我关注了 StackOverflow 关于返回由另一列分组的列的最大值,并出现意外的 Java 异常.
I followed a post on StackOverflow about returning the maximum of a column grouped by another column, and got an unexpected Java exception.
这是测试数据:
import pyspark.sql.functions as f
data = [('a', 5), ('a', 8), ('a', 7), ('b', 1), ('b', 3)]
df = spark.createDataFrame(data, ["A", "B"])
df.show()
+---+---+
| A| B|
+---+---+
| a| 5|
| a| 8|
| a| 7|
| b| 1|
| b| 3|
+---+---+
以下是据称适用于其他用户的解决方案:
Here is the solution that allegedly works for other users:
from pyspark.sql import Window
w = Window.partitionBy('A')
df.withColumn('maxB', f.max('B').over(w))\
.where(f.col('B') == f.col('maxB'))\
.drop('maxB').show()
应该产生这个输出:
#+---+---+
#| A| B|
#+---+---+
#| a| 8|
#| b| 3|
#+---+---+
相反,我得到:
java.lang.UnsupportedOperationException: Cannot evaluate expression: max(input[2, bigint, false]) windowspecdefinition(input[0, string, true], specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$()))
我只在 Databricks 上的 Spark 2.4 上尝试过这个.我尝试了等效的 SQL 语法并得到了相同的错误.
I have only tried this on Spark 2.4 on Databricks. I tried the equivalent SQL syntax and got the same error.
推荐答案
Databricks Support 能够在 Spark 2.4 上重现该问题,但在早期版本上无法重现.显然,这是由于制定物理计划的方式不同(如果需要,我可以发布他们的回复).计划进行修复.
Databricks Support was able to reproduce the issue on Spark 2.4 but not on earlier versions. Apparently, it arises from a difference in the way the physical plan is formulated (I can post their response if requested). A fix is planned.
与此同时,这里是原始问题的另一种解决方案,不会成为 2.4 版问题的牺牲品:
Meanwhile, here is one alternative solution to the original problem that does not fall prey to the version 2.4 issue:
df.withColumn("maxB", f.max('B').over(w)).drop('B').distinct().show()
+---+----+
| A|maxB|
+---+----+
| b| 3|
| a| 8|
+---+----+
这篇关于在 pyspark.sql.functions.max().over(window) 上使用 .where() 在 Spark 2.4 上抛出 Java 异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论