Thursday, 30 June 2016

Handling skew data in Hive

Method 1: 
Identify the skew value and run the job for that value separately.
ex: If cust_id=100 has skew problem, then divide the records into cust_id=100 and cust_id!=100. then run the individual jobs.
Method 2:
Identify the column creating skew. If it is used for join, try to reduce the skew using multiple columns and use it in join.
Method 3:
Modify the join key(Salting). A simple approach that i follow some times is appending (key%3) at the end of key. ex: key1 can be divided into multiple values like key1_1,key1_2 etc,. This will help distribute the keys.
Method 4:
Divide the data into chunks and execute.

Sunday, 26 June 2016

Sqooping from/to MYSQL and HDFS

Sqoop Import
From MYSQL table to HDFS

sqoop import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --target-dir '/user/horton/exported' -m 1

From MYSQL table to Hive

sqoop import --hive-import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hive-table test.exported3 -m 1

From MYSQL table to Hive schema defined.

sqoop import --hive-import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hive-table test.exported3 -m 1 --map-column-hive 'row_key=int,value=string,ts=timestamp';

From MYSQL table to Hive with delimiters

sqoop import --hive-import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hive-table test.exported2 -m 1 --fields-terminated-by ','

Free form query import

sqoop import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --query 'select * from exported where $CONDITIONS' --target-dir '/user/horton/test_query' -m 1

Incremental import into HDFS

sqoop import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --target-dir '/user/horton/test_incremental' -m 1 --check-column ts --incremental lastmodified --last-value '2017-01-24 23:11:16.0' -merge-key row_key

Incremental import into Hive

sqoop import --hive-import --connect jdbc:mysql://127.0.0.1/export --username hive --password hive --driver com.mysql.jdbc.Driver --table exported --hive-table test.exported1 -m 1 --check-column ts --incremental lastmodified --last-value '2017-01-24 23:11:10.0' -merge-key row_key

Incremental import into HDFS from Query

sqoop import --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --query 'select * from exported where $CONDITIONS' --target-dir '/user/horton/test_incremental/t2' -m 1 --check-column ts --incremental lastmodified --last-value '2017-01-24 23:11:16.0' -merge-key row_key

Incremental import into Hive from Query is not supported.

Sqoop Export
Insert/export from HDFS

sqoop export --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --export-dir /apps/hive/warehouse/test.db/exported1  -m 1 --input-fields-terminated-by '\001'

Insert/export from Hive

sqoop export --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hcatalog-database test --hcatalog-table exported3 --input-fields-terminated-by '\001' -m 1

Update from HDFS

sqoop export --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --export-dir /apps/hive/warehouse/test.db/exported1  -m 1 --update-key row_key --input-fields-terminated-by '\001'

Using Hcatalog

sqoop export --connect jdbc:mysql://127.0.0.1/export --username root --driver com.mysql.jdbc.Driver --table exported --hcatalog-database test --hcatalog-table exported -m 1

Update from Hive using Hcatalog is not supported yet. Getting Java exception and there are open tickets on it.

The Mindset Behind Reliable Data Systems I’ve been in data engineering long enough to see the stack change many times over. Tools come and g...