我有一个数据框的Spark看起来像:
I have a Spark DataFrame that looks like:
| id | value | bin | |----+-------+-----| | 1 | 3.4 | 2 | | 2 | 2.6 | 1 | | 3 | 1.8 | 1 | | 4 | 9.6 | 2 |我有一个函数˚F的取值数组,并返回一个数字。我想列添加到上面的数据帧中的每一行的新列中的值 F的的价值为所有的值具有相同斌项,即条目:
I have a function f that takes an array of values and returns a number. I want to add a column to the above data frame where the value for the new column in each row is the value of f for all the value entries that have the same bin entry, i.e:
| id | value | bin | f_value | |----+-------+-----+---------------| | 1 | 3.4 | 2 | f([3.4, 9.6]) | | 2 | 2.6 | 1 | f([2.6, 1.8]) | | 3 | 1.8 | 1 | f([2.6, 1.8]) | | 4 | 9.6 | 2 | f([3.4, 9.6]) |因为我需要聚合所有值每豳,我不能使用 withColumn 函数来添加这个新列。什么是做到这一点,直到用户定义的聚合功能使得存在的方式进入星火的最佳方式?
Since I need to aggregate all values per bin, I cannot use the withColumn function to add this new column. What is the best way to do this until user defined aggregation functions make there way into Spark?
推荐答案下面code未测试,但只是一个想法。
Below code is not tested, but just an idea.
在配置单元,它可以像这样使用的 collect_list 功能。
In Hive, it can be done like this using collect_list function.
val newDF = sqlContext.sql( "select bin, collect_list() from aboveDF group by bin")下一页加入 aboveDF 和 newDF 上完事。
这是你在寻找什么?
更多推荐
加入聚合列星火数据框
发布评论