Im working on Hive tables im having the following problem. I am having more than 1 billion of xml files in my HDFS. What i want to do is, Each xml file having the 4 different sections. Now i want to split and load the each part in the each table for every xml file
Example :
<?xml version='1.0' encoding='iso-8859-1'?>
<section1>
<id> 1233222 </id>
// having lot of xml tages
</section1>
<section2>
// having lot of xml tages
</section2>
<section3>
// having lot of xml tages
</section3>
<section4>
// having lot of xml tages
</section4>
</xml>
And i have the four tables
section1Table
id section1 // fields
section2Table
id section2
section3Table
id section3
section4Table
id section4
Now i want to split and load the data into each table.
How can i achieve this . Can anyone help me
Thanks
UPDATE
I have tried the following
CREATE EXTERNAL TABLE test(name STRING) LOCATION '/user/sornalingam/zipped/output/Tagged/t1';\
SELECT xpath (name, '//section1') FROM test LIMIT 1 ;
but i got the following error
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"name":"<?xml version='1.0' encoding='iso-8859-1'?>"}
You have several options:
CREATE TABLE xmlfiles (id int, xmlfile string)
. Then use an XPath UDF to do work on the XML.//section1
), follow the instructions in the second half of this tutorial to ingest directly into Hive via XPath.It depends on your level of experience and comfort with these approaches.
Use this:
CREATE EXTERNAL TABLE test(name STRING) LOCATION '/user/sornalingam/zipped/output/Tagged/t1'
tblproperties ("skip.header.line.count"="1", "skip.footer.line.count"="1");
And then use xpath function
You can automate the whole process of converting complex XML to Hive, eg Flexter XML converter can generate Parquet or Avro files that can be queried by Hive.
Here is a blog post that shows how to automate the conversion of MISMO XML to Hive and Parquet
Disclaimer: I work for Sonra
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.