Thursday 30 June 2016

Handling skew data in Hive

Method 1: 
Identify the skew value and run the job for that value separately.
ex: If cust_id=100 has skew problem, then divide the records into cust_id=100 and cust_id!=100. then run the individual jobs.
Method 2:
Identify the column creating skew. If it is used for join, try to reduce the skew using multiple columns and use it in join.
Method 3:
Modify the join key(Salting). A simple approach that i follow some times is appending (key%3) at the end of key. ex: key1 can be divided into multiple values like key1_1,key1_2 etc,. This will help distribute the keys.
Method 4:
Divide the data into chunks and execute.

No comments:

Post a Comment

 A good reference for Shell scripting  https://linuxcommand.org/lc3_writing_shell_scripts.php