Hive中的窗口分析函数

编程入门 行业动态 更新时间:2024-10-23 19:27:41

Hive中的窗口分析<a href=https://www.elefans.com/category/jswz/34/1771370.html style=函数"/>

Hive中的窗口分析函数

文章目录

        • 环境准备
        • SUM()、Coun()、Max()、Min()、Avg()
        • ROW_NUMBER、RANK()、DENSE_RANK()
        • NTILE(num)
        • CUME_DIST()、PERCENT_RANK()
        • LAG() 函数

环境准备

准备工作:创建表:

CREATE EXTERNAL TABLE cookie_pv (  
cookieid string,  
createtime string,  
pv INT  
) ROW FORMAT DELIMITED   
FIELDS TERMINATED BY ',' 

加载数据:

load data local inpath '/home/hadoop/data/cookie.log' into  table hive_function_analyze.cookie_pvselect * from cookie_pvcookie1	2017-04-02	8
cookie1	2017-04-03	5
cookie1	2017-04-04	12
cookie1	2017-04-05	13
cookie1	2017-04-06	3
cookie1	2017-04-07	9
cookie1	2017-04-08	12
cookie1	2017-04-09	18
。。。

SUM()、Coun()、Max()、Min()、Avg()

sum()、Count()、Max()、Min()、Avg()函数结合Over函数进行使用,over函数用于确定窗口的大小
over()函数相当于是设定一个窗口,按照窗口进行统计。
over()函数内部只有 partition by 的时候,窗口的大小是分组内的所有行。
over()函数内部存在 partition by 和 order by的时候,窗口的大小是起始行到当前行。
over()函数内部存在 partition by 和 order by的时候并且加上between and的限制条件的时候,则按照条件进行确定窗口的大小。

select * from cookie_pv
select
cookieid,
createtime,
pv,
--分组内所有行
SUM(pv) OVER(PARTITION BY cookieid) as pv1,	
-- 默认为从起点到当前行																		  
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) as pv2,		
--从起点到当前行											    
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING 
AND CURRENT ROW) AS pv3,    
--当前行+往前3行 
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 
CURRENT ROW) AS pv4,		
--当前行+往前3行+往后1行  	 
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 
FOLLOWING) AS pv5,  	
---当前行+往后所有行    		
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND 
UNBOUNDED FOLLOWING) AS pv6  	
from cookie_pv

Count、Max、Min、Avg等用法和Sum函数相同


ROW_NUMBER、RANK()、DENSE_RANK()

  • -ROW_NUMBER() 生成row的行号,比如1,2,3,4,5,6,7,8,9
  • -RANK() 排名相等则会在相应的位次中留下空位,比如1,2,3,3,5,6
  • -DENSE_RANK() 排名相等则不会在相应的位次中留下空位,比如1,2,3,3,4,5
select
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv DESC) as pv1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv DESC) as pv2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) as pv3
FROM cookie_pvcookie1	2017-04-10	23	1	1	1
cookie1	2017-04-09	18	2	2	2
cookie1	2017-04-13	17	3	3	3
cookie1	2017-04-05	13	4	4	4
cookie1	2017-04-11	12	5	5	5
cookie1	2017-04-04	12	5	5	6
cookie1	2017-04-08	12	5	5	7
cookie1	2017-04-12	11	8	6	8
cookie1	2017-04-07	9	9	7	9
cookie1	2017-04-02	8	10	8	10
cookie1	2017-04-14	7	11	9	11

案例:分组求Top N

--分组求top 3select * from (
select cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) as rn
FROM 
cookie_pv
) T
WHERE T.rn <4

NTILE(num)

  • 将分组数据按照顺序进行切成num片
select 
cookieid,
createtime,pv, 
NTILE(2) OVER(PARTITION BY cookieid ORDER BY  pv DESC) PV1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY  pv DESC) PV2,
NTILE(4) OVER(PARTITION BY cookieid ORDER BY  pv DESC) PV3,
NTILE(4) OVER(PARTITION BY cookieid ORDER BY pv DESC)  PV4
FROM cookie_pvcookie1	2017-04-10	23	1	1	1	1
cookie1	2017-04-09	18	1	1	1	1
cookie1	2017-04-13	17	1	1	1	1
cookie1	2017-04-05	13	1	1	1	1
cookie1	2017-04-11	12	1	1	2	2
cookie1	2017-04-04	12	1	2	2	2
cookie1	2017-04-08	12	1	2	2	2
cookie1	2017-04-12	11	2	2	2	2
cookie1	2017-04-07	9	2	2	3	3
cookie1	2017-04-02	8	2	2	3	3
cookie1	2017-04-14	7	2	3	3	3
cookie1	2017-04-03	5	2	3	4	4
cookie1	2017-04-06	3	2	3	4	4
cookie1	2017-04-15	2	2	3	4	4

CUME_DIST()、PERCENT_RANK()

  • CUME_DIST 小于等于当前行值的行数 / 分组内的总行数
  • PERCENT_RANK 分组内当前行的rank -1 / 分组内总行数 - 1
cookie2	2017-04-10	23	0.1	0
cookie2	2017-04-09	18	0.2	0.1111111111111111
cookie2	2017-04-13	17	0.3	0.2222222222222222
cookie2	2017-04-05	13	0.4	0.3333333333333333
cookie2	2017-04-04	12	0.7	0.4444444444444444
cookie2	2017-04-08	12	0.7	0.4444444444444444
cookie2	2017-04-11	12	0.7	0.4444444444444444
cookie2	2017-04-12	11	0.8	0.7777777777777778
cookie2	2017-04-07	9	0.9	0.8888888888888888
cookie2	2017-04-02	8	1	1

LAG() 函数

窗口内往上取N行的值 ,如果有设置default到取default,如果没有设置default就取NULL

select 
cookieid,
createtime,
pv,
LAG(createtime) OVER(PARTITION BY cookieid ORDER BY  pv DESC) PV1,
LAG(createtime,3,1) OVER(PARTITION BY cookieid ORDER BY  pv DESC) PV1,
LAG(createtime,3,2) OVER(PARTITION BY cookieid ORDER BY  pv DESC) PV2
from cookie_pvcookie1	2017-04-10	23	NULL		1			2
cookie1	2017-04-09	18	2017-04-10	1			2
cookie1	2017-04-13	17	2017-04-09	1			2
cookie1	2017-04-05	13	2017-04-13	2017-04-10	2017-04-10
cookie1	2017-04-11	12	2017-04-05	2017-04-09	2017-04-09
cookie1	2017-04-04	12	2017-04-11	2017-04-13	2017-04-13
cookie1	2017-04-08	12	2017-04-04	2017-04-05	2017-04-05
cookie1	2017-04-12	11	2017-04-08	2017-04-11	2017-04-11
cookie1	2017-04-07	9	2017-04-12	2017-04-04	2017-04-04

更多推荐

Hive中的窗口分析函数

本文发布于:2023-07-28 20:15:59,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1298260.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:函数   窗口   Hive

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!