我有一个插入数据到目标表,其中所有的列应该填充从不同的源表,除了代理键列;这应该是目标表的最大值加上自动增量值开始1.我可以通过使用row_number()函数生成自动增量值,但是在同一个查询中,我应该如何从目标表中获取代理键的最大值。 HIVE中是否有任何概念可以选择代理键的最大值并将其保存在临时变量中?或者有没有其他简单的方法来达到这个结果?
解决方案以上两种方法可以解决上述问题。 (通过示例进行解释)
方法1:使用shell脚本通过$ {hiveconf}变量获取最大值并设置为配置单元命令
方法2:使用row_sequence(),max()和join操作 $ b 我的环境:
hadoop-2.6.0 apache-hive-2.0.0-bin步骤:(注意:步骤1步骤1:创建源表和目标表
/ p>
源
配置单元> create table source_table1(字符串名称); hive> create table source_table2(string name); hive> create table source_table2(string name);target
hive> create table target_table(int id,string name);第2步:将数据加载到源表中
hive>加载数据本地inpath'source_table1.txt'放入表中source_table1; hive>将数据本地inpath'source_table2.txt'加载到表中source_table2; hive>将数据本地inpath'source_table3.txt'加载到表中source_table3;样本输入:
source_table1.txt
a b csource_table2.txt
d e fsource_table3.txt g h i
方式1:第3步:创建一个shell脚本hive_auto_increment.sh
#!/ bin / sh hive -e'从target_table'选择max(id)> max.txt wait value =`cat max.txt` hive --hiveconf mx = $ value -eadd jar /home/apache-hive-2.0.0-bin/ lib / hive-contrib-2.0.0.jar; 创建临时函数row_sequence as'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'; set mx; set hiveconf :mx; INSERT INTO TABLE target_table SELECT from source_table1; row_sequence(); INSERT INTO TABLE target_table SELECT(\ $ {hiveconf:mx} + row_sequence()),来自source_table2的名称; INSERT INTO TABLE target_table SELECT(\ $ {hiveconf:mx} + row_sequence()),来自source_table3的名称; 等待 hive -eselect * from target_table;第4步:运行shell脚本
> bash hive_auto_increment.sh方法2:
第3步:添加Jar
配置单元>添加jar / home /apache-hive-2.0.0-bin/lib/hive-contrib-2.0.0.jar;第四步:借助hive contrib jar注册row_sequence函数 p>
hive>创建临时函数row_sequence as'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';第5步:将source_table1加载到target_table
hive> INSERT INTO TABLE target_table select row_sequence(),name from source_table1;第6步:加载其他来源到target_table
hive> INSERT INTO TABLE target_table SELECT mrowcount + row_sequence(),来自source_table2的T.name T join(select max(id)as rowcount from target_table)M; hive> INSERT INTO TABLE target_table SELECT mrowcount + row_sequence(),来自source_table3的T.name T join(从target_table中选择max(id)作为rowcount)M;输出:
INFO:OK + --------------- + -------------- --- + - + | target_table.id | target_table.name + --------------- + ----------------- + - + | 1 | a | | 2 | b | | 3 | c | | 4 | d | | 5 | e | | 6 | f | | 7 | g | | 8 | h | | 9 |我|
I have a to insert data into a target table where all columns should be populated from different source tables except the surrogate key column; which should be maximum value of the target table plus auto increment value starting 1. I can generate auto increment value by using row_number() function, but in the same query how should I get the max value of surrogate key from target table. Is there any concept in HIVE where I can select the max value of surrogate key and save it in a temporary variable? Or is there any other simple way to achieve this result?
解决方案Here are two approaches which worked for me for the above problem. ( explained with example)
Approach 1: getting the max and setting to hive commands through ${hiveconf} variable using shell script
Approach 2: using row_sequence(), max() and join operations
My Environment:
hadoop-2.6.0 apache-hive-2.0.0-binSteps: (note: step 1 and step 2 are common for both approaches. Starting from step 3 , it differs for both)
Step 1: create source and target tables
source
hive>create table source_table1(string name); hive>create table source_table2(string name); hive>create table source_table2(string name);target
hive>create table target_table(int id,string name);Step 2: load data into source tables
hive>load data local inpath 'source_table1.txt' into table source_table1; hive>load data local inpath 'source_table2.txt' into table source_table2; hive>load data local inpath 'source_table3.txt' into table source_table3;Sample Input:
source_table1.txt
a b csource_table2.txt
d e fsource_table3.txt
g h iApproach 1:
Step 3: create a shell script hive_auto_increment.sh
#!/bin/sh hive -e 'select max(id) from target_table' > max.txt wait value=`cat max.txt` hive --hiveconf mx=$value -e "add jar /home/apache-hive-2.0.0-bin/lib/hive-contrib-2.0.0.jar; create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'; set mx; set hiveconf:mx; INSERT INTO TABLE target_table SELECT row_sequence(),name from source_table1; INSERT INTO TABLE target_table SELECT (\${hiveconf:mx} +row_sequence()),name from source_table2; INSERT INTO TABLE target_table SELECT (\${hiveconf:mx} +row_sequence()),name from source_table3;" wait hive -e "select * from target_table;"Step 4: run the shell script
> bash hive_auto_increment.shApproach 2:
Step 3: Add Jar
hive>add jar /home/apache-hive-2.0.0-bin/lib/hive-contrib-2.0.0.jar;Step 4: register row_sequence function with help of hive contrib jar
hive>create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';Step 5: load the source_table1 to target_table
hive>INSERT INTO TABLE target_table select row_sequence(),name from source_table1;Step 6: load the other sources to target_table
hive>INSERT INTO TABLE target_table SELECT M.rowcount+row_sequence(),T.name from source_table2 T join (select max(id) as rowcount from target_table) M; hive>INSERT INTO TABLE target_table SELECT M.rowcount+row_sequence(),T.name from source_table3 T join (select max(id) as rowcount from target_table) M;output:
INFO : OK +---------------+-----------------+--+ | target_table.id | target_table.name +---------------+-----------------+--+ | 1 | a | | 2 | b | | 3 | c | | 4 | d | | 5 | e | | 6 | f | | 7 | g | | 8 | h | | 9 | i |
更多推荐
配置一定数量后配置自动增量
发布评论