函数"/>
Hive中的窗口分析函数
文章目录
- 环境准备
- SUM()、Coun()、Max()、Min()、Avg()
- ROW_NUMBER、RANK()、DENSE_RANK()
- NTILE(num)
- CUME_DIST()、PERCENT_RANK()
- LAG() 函数
环境准备
准备工作:创建表:
CREATE EXTERNAL TABLE cookie_pv (
cookieid string,
createtime string,
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
加载数据:
load data local inpath '/home/hadoop/data/cookie.log' into table hive_function_analyze.cookie_pvselect * from cookie_pvcookie1 2017-04-02 8
cookie1 2017-04-03 5
cookie1 2017-04-04 12
cookie1 2017-04-05 13
cookie1 2017-04-06 3
cookie1 2017-04-07 9
cookie1 2017-04-08 12
cookie1 2017-04-09 18
。。。
SUM()、Coun()、Max()、Min()、Avg()
sum()、Count()、Max()、Min()、Avg()函数结合Over函数进行使用,over函数用于确定窗口的大小
over()函数相当于是设定一个窗口,按照窗口进行统计。
over()函数内部只有 partition by 的时候,窗口的大小是分组内的所有行。
over()函数内部存在 partition by 和 order by的时候,窗口的大小是起始行到当前行。
over()函数内部存在 partition by 和 order by的时候并且加上between and的限制条件的时候,则按照条件进行确定窗口的大小。
select * from cookie_pv
select
cookieid,
createtime,
pv,
--分组内所有行
SUM(pv) OVER(PARTITION BY cookieid) as pv1,
-- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) as pv2,
--从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS pv3,
--当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND
CURRENT ROW) AS pv4,
--当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1
FOLLOWING) AS pv5,
---当前行+往后所有行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND
UNBOUNDED FOLLOWING) AS pv6
from cookie_pv
Count、Max、Min、Avg等用法和Sum函数相同
ROW_NUMBER、RANK()、DENSE_RANK()
- -ROW_NUMBER() 生成row的行号,比如1,2,3,4,5,6,7,8,9
- -RANK() 排名相等则会在相应的位次中留下空位,比如1,2,3,3,5,6
- -DENSE_RANK() 排名相等则不会在相应的位次中留下空位,比如1,2,3,3,4,5
select
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv DESC) as pv1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv DESC) as pv2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) as pv3
FROM cookie_pvcookie1 2017-04-10 23 1 1 1
cookie1 2017-04-09 18 2 2 2
cookie1 2017-04-13 17 3 3 3
cookie1 2017-04-05 13 4 4 4
cookie1 2017-04-11 12 5 5 5
cookie1 2017-04-04 12 5 5 6
cookie1 2017-04-08 12 5 5 7
cookie1 2017-04-12 11 8 6 8
cookie1 2017-04-07 9 9 7 9
cookie1 2017-04-02 8 10 8 10
cookie1 2017-04-14 7 11 9 11
案例:分组求Top N
--分组求top 3select * from (
select cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) as rn
FROM
cookie_pv
) T
WHERE T.rn <4
NTILE(num)
- 将分组数据按照顺序进行切成num片
select
cookieid,
createtime,pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY pv DESC) PV1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) PV2,
NTILE(4) OVER(PARTITION BY cookieid ORDER BY pv DESC) PV3,
NTILE(4) OVER(PARTITION BY cookieid ORDER BY pv DESC) PV4
FROM cookie_pvcookie1 2017-04-10 23 1 1 1 1
cookie1 2017-04-09 18 1 1 1 1
cookie1 2017-04-13 17 1 1 1 1
cookie1 2017-04-05 13 1 1 1 1
cookie1 2017-04-11 12 1 1 2 2
cookie1 2017-04-04 12 1 2 2 2
cookie1 2017-04-08 12 1 2 2 2
cookie1 2017-04-12 11 2 2 2 2
cookie1 2017-04-07 9 2 2 3 3
cookie1 2017-04-02 8 2 2 3 3
cookie1 2017-04-14 7 2 3 3 3
cookie1 2017-04-03 5 2 3 4 4
cookie1 2017-04-06 3 2 3 4 4
cookie1 2017-04-15 2 2 3 4 4
CUME_DIST()、PERCENT_RANK()
- CUME_DIST 小于等于当前行值的行数 / 分组内的总行数
- PERCENT_RANK 分组内当前行的rank -1 / 分组内总行数 - 1
cookie2 2017-04-10 23 0.1 0
cookie2 2017-04-09 18 0.2 0.1111111111111111
cookie2 2017-04-13 17 0.3 0.2222222222222222
cookie2 2017-04-05 13 0.4 0.3333333333333333
cookie2 2017-04-04 12 0.7 0.4444444444444444
cookie2 2017-04-08 12 0.7 0.4444444444444444
cookie2 2017-04-11 12 0.7 0.4444444444444444
cookie2 2017-04-12 11 0.8 0.7777777777777778
cookie2 2017-04-07 9 0.9 0.8888888888888888
cookie2 2017-04-02 8 1 1
LAG() 函数
窗口内往上取N行的值 ,如果有设置default到取default,如果没有设置default就取NULL
select
cookieid,
createtime,
pv,
LAG(createtime) OVER(PARTITION BY cookieid ORDER BY pv DESC) PV1,
LAG(createtime,3,1) OVER(PARTITION BY cookieid ORDER BY pv DESC) PV1,
LAG(createtime,3,2) OVER(PARTITION BY cookieid ORDER BY pv DESC) PV2
from cookie_pvcookie1 2017-04-10 23 NULL 1 2
cookie1 2017-04-09 18 2017-04-10 1 2
cookie1 2017-04-13 17 2017-04-09 1 2
cookie1 2017-04-05 13 2017-04-13 2017-04-10 2017-04-10
cookie1 2017-04-11 12 2017-04-05 2017-04-09 2017-04-09
cookie1 2017-04-04 12 2017-04-11 2017-04-13 2017-04-13
cookie1 2017-04-08 12 2017-04-04 2017-04-05 2017-04-05
cookie1 2017-04-12 11 2017-04-08 2017-04-11 2017-04-11
cookie1 2017-04-07 9 2017-04-12 2017-04-04 2017-04-04
更多推荐
Hive中的窗口分析函数
发布评论