Thursday 10 October 2013

SCD Type1 implementation in Hive

Consider below tables with partitions.

Stage table Final Table
p1 p1
p2 p2
p5 p3
p4

stage_partition left join final_partition on partition columns
Set1: Consider the data from stage table where there is a match - Updates are available for the data in these partitions
Set2: Consider the data from stage table where there is no match - New records (Inserts)

create table final_partition_updates as select * from final_table where partition_col in (set1);
create table stage_partition_updates as select * from stage_table where partition_col in (set1);

WITH pkeyJoin as select a.*,b.* from final_partition_updates a full join stage_partition_updates b
on a.pkey = b.pkey

create table upserts as
select b.* from pkeyJoin where a.pkey is not null  --Updates
UNION ALL
select b.* from pkeyJoin where a.pkey is null; --Inserts

Insert into table upserts select * from stage_table where partition_col in (set2);

Insert overwrite final_table PARTITION(partition column) SELECT * from upserts;

 A good reference for Shell scripting  https://linuxcommand.org/lc3_writing_shell_scripts.php