Thursday 30 June 2016

Handling skew data in Hive

Method 1: 
Identify the skew value and run the job for that value separately.
ex: If cust_id=100 has skew problem, then divide the records into cust_id=100 and cust_id!=100. then run the individual jobs.
Method 2:
Identify the column creating skew. If it is used for join, try to reduce the skew using multiple columns and use it in join.
Method 3:
Modify the join key(Salting). A simple approach that i follow some times is appending (key%3) at the end of key. ex: key1 can be divided into multiple values like key1_1,key1_2 etc,. This will help distribute the keys.
Method 4:
Divide the data into chunks and execute.

Sunday 26 June 2016

Sqooping from/to MYSQL and HDFS

Sqoop Import
From MYSQL table to HDFS

sqoop import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --target-dir '/user/horton/exported' -m 1

From MYSQL table to Hive

sqoop import --hive-import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hive-table test.exported3 -m 1

From MYSQL table to Hive schema defined.

sqoop import --hive-import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hive-table test.exported3 -m 1 --map-column-hive 'row_key=int,value=string,ts=timestamp';

From MYSQL table to Hive with delimiters

sqoop import --hive-import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hive-table test.exported2 -m 1 --fields-terminated-by ','

Free form query import

sqoop import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --query 'select * from exported where $CONDITIONS' --target-dir '/user/horton/test_query' -m 1

Incremental import into HDFS

sqoop import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --target-dir '/user/horton/test_incremental' -m 1 --check-column ts --incremental lastmodified --last-value '2017-01-24 23:11:16.0' -merge-key row_key

Incremental import into Hive

sqoop import --hive-import --connect jdbc:mysql://127.0.0.1/export --username hive --password hive --driver com.mysql.jdbc.Driver --table exported --hive-table test.exported1 -m 1 --check-column ts --incremental lastmodified --last-value '2017-01-24 23:11:10.0' -merge-key row_key

Incremental import into HDFS from Query

sqoop import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --query 'select * from exported where $CONDITIONS' --target-dir '/user/horton/test_incremental/t2' -m 1 --check-column ts --incremental lastmodified --last-value '2017-01-24 23:11:16.0' -merge-key row_key

Incremental import into Hive from Query is not supported.

Sqoop Export
Insert/export from HDFS

sqoop export --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --export-dir /apps/hive/warehouse/test.db/exported1  -m 1 --input-fields-terminated-by '\001'

Insert/export from Hive

sqoop export --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hcatalog-database test --hcatalog-table exported3 --input-fields-terminated-by '\001' -m 1

Update from HDFS

sqoop export --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --export-dir /apps/hive/warehouse/test.db/exported1  -m 1 --update-key row_key --input-fields-terminated-by '\001'

Using Hcatalog

sqoop export --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hcatalog-database test --hcatalog-table exported -m 1

Update from Hive using Hcatalog is not supported yet. Getting Java exception and there are open tickets on it.

 A good reference for Shell scripting  https://linuxcommand.org/lc3_writing_shell_scripts.php