我有一个这样的spark数据框:
I have a spark dataframe like this:
word1 word2 co-occur ---- ----- ------- w1 w2 10 w2 w1 15 w2 w3 11我的预期结果是:
word1 word2 co-occur ---- ----- ------- w1 w2 25 w2 w3 11我尝试了数据框的 groupBy 和聚合函数,但是我无法提出解决方案.
I tried dataframe's groupBy and aggregate functions but I couldn't come up with the solution.
推荐答案您需要一个包含按排序顺序排列的两个单词的列,然后该列可用于 groupBy .您可以使用包含 word1 和 word 的数组创建新列,如下所示:
You need a single column containing both words in sorted order, this column can then be used for the groupBy. You can create a new column with an array containing word1 and word as follows:
df.withColumn("words", sort_array(array($"word1", $"word2"))) .groupBy("words") .agg(sum($"co-occur").as("co-occur"))这将产生以下结果:
words co-occur ----- -------- ["w1","w2"] 25 ["w2","w3"] 11如果您想同时使用两个单词作为spearate dataframe列,请在之后使用 getItem 方法.对于上面的示例,在上面添加以下几行:
If you would like to have both words as spearate dataframe columns, use the getItem method afterwards. For the above example, add the following lines to the above:
df.withColumn("word1", $"words".getItem(0)) .withColumn("word2", $"words".getItem(1)) .drop($"words")最终的结果dataFrame看起来像这样:
The final resultant dataFrame would look like this:
word1 word2 co-occur ---- ----- ------- w1 w2 25 w2 w3 11更多推荐
如果其他列匹配,则求和一列值
发布评论