Sunday 21 January 2018

Reading XML files using Hive

Scenario 1:
Consider an XML file as below

<row Id="1" Reputation="100" creationDate="2008-07-31T14:22:31.317" DisplayName="Test 1" LastAccessDate="2016-12-10T22:12:46.367" WebsiteUrl="http://www.joelonsoftware.com/" Location="New York, NY" />
<row Id="2" Reputation="250" creationDate="2008-07-31T14:22:31.317" DisplayName="Test User2" LastAccessDate="2016-12-10T22:12:46.367" WebsiteUrl="http://www.test.com/" Location="Phoenix, AZ"  Age="25"/>

Download Jar:

From Hive CLI/Beeline:

add jar <above_jar_location>;

CREATE EXTERNAL TABLE test_xml(
Id int,
Reputation int,
creationDate timestamp,
displayName string,
location string,
age int
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.id"="/row/@Id",
"column.xpath.Reputation"="/row/@Reputation",
"column.xpath.creationDate"="/row/@CreationDate",
"column.xpath.displayName"="/row/@DisplayName",
"column.xpath.location"="/row/@Location",
"column.xpath.age"="/row/@Age"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/path/to/xml/file'
TBLPROPERTIES (
"xmlinput.start"="<row ",
"xmlinput.end"="/>"
);







Scenario 2:

Consider Below XML:

<CATALOG>
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<CURRENCY>USD</CURRENCY>
<PRICE>34</PRICE>
<YEAR>2017</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming with Spark</TITLE>
<AUTHOR>I don't know</AUTHOR>
<CURRENCY>USA</CURRENCY>
<PRICE>29</PRICE>
<YEAR>2018</YEAR>
</BOOK>
</CATALOG>

Create table as below:

CREATE EXTERNAL TABLE books (title string, price float,currency string)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.title"="/BOOK/TITLE/text()",
"column.xpath.currency"="/BOOK/CURRENCY/text()",
"column.xpath.price"="/BOOK/PRICE/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/path/to/xml'
TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");

**** Do not specify the Root node i.e <CATALOG> in TBLPROPERTIES****


 A good reference for Shell scripting  https://linuxcommand.org/lc3_writing_shell_scripts.php