拉链表须知

编程入门行业动态更新时间:2024-10-10 02:16:14

拉链表<a href=https://www.elefans.com/category/jswz/34/1750894.html style= 须知"/>

拉链表须知

1、拉链表由来
拉链表是一种维护历史状态，以及最新状态数据的一种表。拉链表根据拉链粒度的不同，去除了一部分不变的记录，通过拉链表可以很方便的还原出拉链时点的客户记录，实际上相当于快照。
2、拉链表的特征
1）记录一个事物从开始，一直到当前状态的所有变化的信息；
2）每次上报的都是历史记录的最终状态，是记录在当前时刻的历史总量；
3）当前记录存的是当前时间之前的所有历史记录的最后变化量（总量）；
4）存量一般设计成拉链表（月报 - 常用、日报）；
5）关链时间可以是3000年，9999等比较大的年份,例如[2022-07-01,9999-12-31]
表示连续的状态，因为开始时间为2022-07-01，闭合时间为未知；
3、拉链表适用的场景

数据量比较大。
表中的部分字段会被更新，比如用户的地址，银行利率，订单的状态等。
需要查看某一个时间点或者时间段的历史快照信息，比如，查看利率在历史某一个时间点的状态。
变化的比例和频率不是很大，比如，总共有1000万的会员，每天新增和发生变化的有10万左右。
如果对这边表每天都保留一份全量，那么每次全量中会保存很多不变的信息，对存储是极大的浪费;
拉链历史表，既能满足反应数据的历史状态，又可以最大程度的节省存储。
4.实例
现有一份公司内部员工表，有信息会发生更改，现依据结束时间为分区条件做成拉链表

create table if not exists test.a_hist
(    id STRING COMMENT 'id',start_dt string COMMENT'拉链开始时间',name STRING COMMENT '名字',dep_id STRING COMMENT '部门id',type STRING COMMENT '是否在职|Y,N'
) comment '拉链表'partitioned by (end_dt string comment '分区')stored as orctblproperties ('orcpress' = 'SNAPPY')
;-- 1.取当天全量比对3000-12-31记录取出纯增量&有变更数据drop table IF EXISTS test_tmp.tmp_a_inc;CREATE table test_tmp.tmp_a_inc stored as orcasSELECTp1.id,'${last_date}' as start_dt,p1.name,p1.dep_id ,p1.type from  test_dim.a_df p1where  p1.pt_d = '${last_date}'AND  P1.biz_id IS NOT NULLand not EXISTS (SELECT1from test_dim.a_hist p2WHERE  p2.end_dt = '3000-12-31'AND  p2.id IS NOT NULLAND  p1.id = p2.idand  coalesce(p1.name,'-1') = coalesce(p2.name,'-1')and coalesce(p1.dep_id,'-1') = coalesce(p2.dep_id,'-1')and coalesce(p1.type,'-1') = coalesce(p2.type,'-1')                  )
;-- 2.3000-12-31中未发生变更记录
drop table IF EXISTS test_tmp.tmp_a_stock;
CREATE table test_tmp.tmp_a_stock 
stored as orc
as
SELECT p1.id,p1.start_dt,p1.name,p1.dep_id ,p1.type 
FROM test_dim.a_hist p1WHERE    p1.end_dt = '3000-12-31'AND    P1.id IS NOT NULLAND    not EXISTS (select 1 from test_tmp.tmp_a_inc p2 where p1.id = p2.id);-- 3.3000-12-31中发生变更记录drop table IF EXISTS test_tmp.tmp_a_upd;CREATE table test_tmp.tmp_a_upd stored as orcasSELECTp1.id,p1.start_dt,p1.name,p1.dep_id ,p1.type from    servyou_dim.a_hist p1WHERE    p1.end_dt = '3000-12-31'AND    P1.id IS NOT NULLAND    P1.START_DT <> '${last_date}'AND    EXISTS (select 1 from test_tmp.tmp_a_inc p2 where p1.id = p2.id);
-- 将发生变动的历史记录插入end_dt = '${last_date}'分区中
-- 将新的3000-12-31记录插入end_dt = '3000-12-31'分区中insert overwrite table test.a_hist partition(end_dt)select  *FROM  (select*,'${last_date}' as end_dtfrom  test_tmp.tmp_a_updunion allSELECT  *,'3000-12-31' as end_dtfrom  test_tmp.tmp_a_incunion allselect  *,'3000-12-31' as end_dtfrom  test_tmp.tmp_a_stock) x distribute by x.end_dt;

last_date等同于yesterday
那么假设前一天数据有误，而代码也已经跑完该怎么恢复呢

end_dt = '${last_date}' 的记录需要还原至3000-12-31分区中
end_dt = '3000-12-31' 且 start_dt = '${last_date}'的记录需要剔除
drop table if exists test_tmp.tmp_a_recover;create table test_tmp.tmp_a_recover stored as orcasselectp1.id,p1.start_dt,p1.name,p1.dep_id ,p1.type from    test_dim.a_hist p1WHERE    p1.end_dt = '${last_date}'
union ALLselectp1.id,p1.start_dt,p1.name,p1.dep_id ,p1.type from    test_dim.a_hist p1WHERE    p1.end_dt = '3000-12-31'AND    p1.start_dt <> '${last_date}';INSERT overwrite table test.a_hist partition(end_dt = '3000-12-31')select*FROM  test_tmp.tmp_a_recover;ALTER table test.a_hist drop partition(end_dt = '${last_date}');

先执行上面的代码后，再去执行最上面的代码

更多推荐

拉链表须知

本文发布于:2024-02-14 11:04:11，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1762980.html