窗口功能不工作Pyspark sqlcontext

编程入门行业动态更新时间:2024-10-25 20:17:34

本文介绍了窗口功能不工作Pyspark sqlcontext的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个数据帧，我希望将数据汇总到7天做一些聚集上的一些功能。

I have a data frame and I want to roll up the data into 7days and do some aggregation on some of the function.

我有一个pyspark SQL数据帧像------

I have a pyspark sql dataframe like ------

Sale_Date|P_1|P_2|P_3|G_1|G_2|G_3|Total_Sale|Sale_Amt|Promo_Disc_Amt | |2013-04-10| 1| 9| 1| 1| 1| 1| 1| 295.0|0.0| |2013-04-11| 1| 9| 1| 1| 1| 1| 3| 567.0|0.0| |2013-04-12| 1| 9| 1| 1| 1| 1| 2| 500.0|200.0| |2013-04-13| 1| 9| 1| 1| 1| 1| 1| 245.0|20.0| |2013-04-14| 1| 9| 1| 1| 1| 1| 1| 245.0|0.0| |2013-04-15| 1| 9| 1| 1| 1| 1| 2| 500.0|200.0| |2013-04-16| 1| 9| 1| 1| 1| 1| 1| 250.0|0.0|

我已经申请窗口函数在数据帧如下： -

I have applied a window function over the data frame as follows -

days = lambda i: i * 86400 windowSp = Window().partitionBy(dataframeOfquery3["P_1"],dataframeOfquery3["P_2"],dataframeOfquery3["P_3"],dataframeOfquery3["G_1"],dataframeOfquery3["G_2"],dataframeOfquery3["G_3"])\ .orderBy(dataframeOfquery3["Sale_Date"].cast("timestamp").cast("long").desc())\ .rangeBetween(-(days(7)), 0)

现在我想执行一些聚集即运用一些窗口功能，如以下 -

Now I want to perform some aggregation i.e. applying some windows functions like the following --

df = dataframeOfquery3.select(min(dataframeOfquery3["Sale_Date"].over(windowSp).alias("Sale_Date"))) df.show()

但它给下面的错误。

But it is giving following error.

py4j.protocol.Py4JJavaError: An error occurred while calling o138.select. : org.apache.spark.sql.AnalysisException: Could not resolve window function 'min'. Note that, using window functions currently requires a HiveContext;

我使用Apache 1.6.0星火pre-建立在Hadoop的。

I am using Apache Spark 1.6.0 Pre-built on Hadoop.

推荐答案

错误那种说一切：

您将需要一个版本火花支持蜂巢（建立与蜂巢）比你可以声明hivecontext：

You'll need a version of spark that supports hive (build with hive) than you can declare a hivecontext :

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

，然后使用该上下文来执行你的窗口的功能。

and then use that context to perform your window function.

在Python中：

# sc is an existing SparkContext. from pyspark.sql import HiveContext sqlContext = HiveContext(sc)

您可以约之间的 SQLContext 和 HiveContext here.

SparkSQL具有SQLContext和HiveContext。 HiveContext是一个超集SQLContext的。星火社区建议使用HiveContext。你可以看到，当您运行火花壳，这就是交互式驱动器的应用，它会自动创建定义为SC和定义为一个sqlContext一个HiveContext SparkContext。该HiveContext允许你执行的SQL查询以及蜂巢命令。对于发生的pyspark相同的行为。

SparkSQL has a SQLContext and a HiveContext. HiveContext is a super set of the SQLContext. The Spark community suggest using the HiveContext. You can see that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext defined as sc and a HiveContext defined as sqlContext. The HiveContext allows you to execute SQL queries as well as Hive commands. The same behavior occurs for pyspark.

更多推荐

窗口功能不工作Pyspark sqlcontext

本文发布于:2023-10-25 00:05:52，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1525376.html