分组并在Spark SQL中获取第一个值

编程入门 行业动态 更新时间:2024-10-12 18:20:25
本文介绍了分组并在Spark SQL中获取第一个值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在使用spark sql进行分组操作,因为某些行包含具有不同ID的相同值,在这种情况下,我想选择第一行.

I am doing group by action in spark sql.In that some rows contain same value with different ID.In that case I want to select first row.

这是我的代码.

val highvalueresult = highvalue.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg") .groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg") .alias("RSSI_Weight_avg")) val t2 = averageDF.join(highvalueresult, Seq("tagShortID", "Timestamp", "RSSI_Weight_avg"))

这是我的结果.

tag,timestamp,rssi,listner,rootorg,suborg 2,1496745906,0.7,3878,4,3 4,1496745907,0.6,362,4,3 4,1496745907,0.6,718,4,3 4,1496745907,0.6,1901,4,3

在上面的时间戳记1496745907的结果中,三个列表器的rssi值相同.在这种情况下,我想选择第一行.

In the above result for the time stamp 1496745907 same rssi values for three listner.In this case I want to select the first row.

推荐答案

您可以使用spark sql上下文具有的窗口函数支持 假设您的数据框是:

You can use the windowing functions support that spark sql context has Assuming you dataframe is:

+---+----------+----+-------+-------+------+ |tag| timestamp|rssi|listner|rootorg|suborg| +---+----------+----+-------+-------+------+ | 2|1496745906| 0.7| 3878| 4| 3| | 4|1496745907| 0.6| 362| 4| 3| | 4|1496745907| 0.6| 718| 4| 3| | 4|1496745907| 0.6| 1901| 4| 3| +---+----------+----+-------+-------+------+

将窗口函数定义为(您可以按列划分/按列排序):

Define a window function as(you can partition by/order by your columns):

val window = Window.partitionBy("timestamp", "rssi").orderBy("timestamp")

应用窗口功能:

res1.withColumn("rank", row_number().over(window)) +---+----------+----+-------+-------+------+----+ |tag| timestamp|rssi|listner|rootorg|suborg|rank| +---+----------+----+-------+-------+------+----+ | 4|1496745907| 0.6| 362| 4| 3| 1| | 4|1496745907| 0.6| 718| 4| 3| 2| | 4|1496745907| 0.6| 1901| 4| 3| 3| | 2|1496745906| 0.7| 3878| 4| 3| 1| +---+----------+----+-------+-------+------+----+

从每个窗口中选择第一行

Select the first rows from each window

res5.where($"rank" === 1) +---+----------+----+-------+-------+------+----+ |tag| timestamp|rssi|listner|rootorg|suborg|rank| +---+----------+----+-------+-------+------+----+ | 4|1496745907| 0.6| 362| 4| 3| 1| | 2|1496745906| 0.7| 3878| 4| 3| 1| +---+----------+----+-------+-------+------+----+

更多推荐

分组并在Spark SQL中获取第一个值

本文发布于:2023-11-22 06:14:49,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1616281.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:第一个   并在   Spark   SQL

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!