本文介绍了从另一个DataFrame添加一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
在Scala Spark中,我可以轻松地将列添加到现有的Dataframe文字中
In Scala Spark, I can easily add a column to an existing Dataframe writing
val newDf = df.withColumn("date_min", anotherDf("date_min"))
在PySpark中这样做会导致AnalysisException.
Doing so in PySpark results in an AnalysisException.
这是我在做什么:
minDf.show(5) maxDf.show(5) +--------------------+ | date_min| +--------------------+ |2016-11-01 10:50:...| |2016-11-01 11:46:...| |2016-11-01 19:23:...| |2016-11-01 17:01:...| |2016-11-01 09:00:...| +--------------------+ only showing top 5 rows +--------------------+ | date_max| +--------------------+ |2016-11-01 10:50:...| |2016-11-01 11:46:...| |2016-11-01 19:23:...| |2016-11-01 17:01:...| |2016-11-01 09:00:...| +--------------------+ only showing top 5 rows然后,导致错误的结果:
And then, what results in an error :
newDf = minDf.withColumn("date_max", maxDf["date_max"]) AnalysisExceptionTraceback (most recent call last) <ipython-input-13-7e19c841fa51> in <module>() 2 maxDf.show(5) 3 ----> 4 newDf = minDf.withColumn("date_max", maxDf["date_max"]) /opt/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col) 1491 """ 1492 assert isinstance(col, Column), "col should be Column" -> 1493 return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx) 1494 1495 @ignore_unicode_prefix /opt/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /opt/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw) 67 e.java_exception.getStackTrace())) 68 if s.startswith('org.apache.spark.sql.AnalysisException: '): ---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace) 70 if s.startswith('org.apache.spark.sql.catalyst.analysis'): 71 raise AnalysisException(s.split(': ', 1)[1], stackTrace) AnalysisException: u'resolved attribute(s) date_max#67 missing from date_min#66 in operator !Project [date_min#66, date_max#67 AS date_max#106];;\n!Project [date_min#66, date_max#67 AS date_max#106]\n+- Project [date_min#66]\n +- Project [cast((cast(date_min#6L as double) / cast(1000 as double)) as timestamp) AS date_min#66, cast((cast(date_max#7L as double) / cast(1000 as double)) as timestamp) AS date_max#67]\n +- SubqueryAlias df, `df`\n +- LogicalRDD [idvisiteur#5, date_min#6L, date_max#7L, sales_sum#8, sales_count#9L]\n'推荐答案
希望这会有所帮助!
from pyspark.sql.functions import monotonically_increasing_id, row_number from pyspark.sql.window import Window minDf = sc.parallelize([['2016-11-01 10:50:00'],['2016-11-01 11:46:00']]).toDF(["date_min"]) maxDf = sc.parallelize([['2016-11-01 10:50:00'],['2016-11-01 11:46:00']]).toDF(["date_max"]) # since there is no common column between these two dataframes add row_index so that it can be joined minDf=minDf.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id()))) maxDf=maxDf.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id()))) minDf = minDf.join(maxDf, on=["row_index"]).drop("row_index") minDf.show()输出为:
+-------------------+-------------------+ | date_min| date_max| +-------------------+-------------------+ |2016-11-01 10:50:00|2016-11-01 10:50:00| |2016-11-01 11:46:00|2016-11-01 11:46:00| +-------------------+-------------------+更多推荐
从另一个DataFrame添加一列
发布评论