蜂巢：消耗大量小型XML文件的最佳方法

Question

I would like to get advices regarding the best approach to store my data in HDFS and to further retrieve values from it using SQL through Hive. 我想获得有关将数据存储在HDFS中以及通过Hive使用SQL进一步从中检索值的最佳方法的建议。

I receive a lot of files in XML format, basically tens of thousands a day. 我收到许多XML格式的文件，基本上每天要成千上万。 Each file has about 10 kB and obeys a given XSD schema. 每个文件大约有10 kB，并遵循给定的XSD架构。 Currently I have more than 120 TB of these XML files stored in a filesystem. 目前，我在文件系统中存储了超过120 TB的这些XML文件。

I was wondering to ingest all these XML files into HDFS in order to offer SQL interface to some applications perform relational queries against the data. 我想将所有这些XML文件都提取到HDFS中，以便为某些应用程序提供对数据执行关系查询的SQL接口。

What key technologies do you think I'll need to build this solution? 您认为构建此解决方案需要哪些关键技术？

For efficient processing, perhaps I'd need to convert these XML files in a better format for Hadoop (ie, RCfile or ORC) and store them in HDFS. 为了高效处理，也许我需要将这些XML文件转换为更好的Hadoop格式（即RCfile或ORC），并将其存储在HDFS中。 The problem is the schema of these files is expected to change over time. 问题是这些文件的架构预计会随着时间而改变。 The nature of my data seems to benefit from partitioning (ie, by date/time or state). 数据的性质似乎受益于分区（即按日期/时间或状态）。 Also, I don't know if data compression is a good idea. 另外，我不知道数据压缩是否是一个好主意。

Here is a sample content I have inside a single XML file: 这是我在单个XML文件中的示例内容：

<invoice schema_version="1.1">
  <general id="123456798">
    <creationdate>2016-03-21 16:25:09-03:00</creationdate>
  </general>
  <buyer id="11">
    <name>The Buyer</name>
    <address>
      <street>1st St</street>
      <city>Los Angeles</city>
      <state>CA</state>
    </address>
  </buyer>
  <seller id="22">
    <name>The Seller</name>
    <address>
      <street>2nd Ave</street>
      <city>Miami</city>
      <state>FL</state>
    </address>
  </seller>
  <items>
    <product id="123">
      <name>Blue Pen</name>
      <price>1.50</price>
      <quantity>4</quantity>
      <subtotal>6.00</subtotal>
    </product>
    <product id="456">
      <name>White Board</name>
      <price>5.20</price>
      <quantity>2</quantity>
      <subtotal>10.40</subtotal>
    </product>
  </items>
  <amount>
    <products>16.40</products>
    <shipping>2.35</shipping>
    <total>18.75</shipping>
  </amount>
</invoice>

Thus, I'd like to perform SQL queries similar to these: 因此，我想执行类似于以下的SQL查询：

SELECT general.creationdate, buyer.name, amount.total
FROM invoice
WHERE general_id = '123456798';

SELECT count(*) AS qty, sum(amount.total) AS total
FROM invoice
WHERE general.creationdate >= '2016-03-01'
GROUP BY seller.address.state;

SELECT b.name, avg(b.price) AS avg_price, sum(b.quantity) AS sum_quantity
FROM invoice a
  JOIN invoice_items b ON (...)
WHERE a.buyer.address.state = 'CA'
GROUP BY b.name
ORDER BY sum_quantity DESC;

Thanks in advance! 提前致谢！

Answer 1

You can write xslt-file to translate incoming xml's into csv format and apply it to your files eg using streaming job: 您可以编写xslt文件，将传入的xml转换为csv格式，并将其应用于文件，例如使用流作业：

hadoop jar hadoop-streaming.jar \
    -mapper 'xsltproc file.xslt -' -file file.xslt \
    -input /path/to/your/xmls \
    -output /path/to/resulting/files

look on https://github.com/whale2/iow-hadoop-streaming if you want to use avro or parquet instead of simple text, this lib also can handle multiple outputs, so you can save each table in separate folder (and ofcourse subfolders if you want partitioning). 如果要使用avro或parquet而不是简单文本，请查看https://github.com/whale2/iow-hadoop-streaming ，此lib还可以处理多个输出，因此您可以将每个表保存在单独的文件夹中（当然子文件夹（如果要分区）。

Next, just create external table in hive to your resulting files and make your sql queries. 接下来，只需在配置单元中为您生成的文件创建外部表并进行sql查询。

if your schema will change, you can just change xslt-file. 如果您的架构会更改，则只需更改xslt文件。

add: to make it working you should delete newlines from input xmls or write wrapper (see http://www.science.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.1_--_Streaming_XML_Files ) 添加：为了使其正常工作，您应该从输入xml中删除换行符或编写包装器（请参见http://www.science.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.1_--__Streaming_XML_Files ）

upd You should write 1 xslt to produce alll record in file like this: upd您应该编写1 xslt以在文件中生成alll记录，如下所示：

header\tval1,val2,val3
details\tval1,val2,val3,val4

next, add option -outputformat net.iponweb.hadoop.streaming.io.ByKeyOutputFormat to you command and you get different files for each key. 接下来，向您的命令添加选项-outputformat net.iponweb.hadoop.streaming.io.ByKeyOutputFormat ，您将为每个密钥获得不同的文件。

What about hadoop profit in this task - distributed processing, if you have small amount of data, you dont need hadoop hadoop在此任务中的获利-分布式处理，如果您的数据量很少，则不需要hadoop

蜂巢：消耗大量小型XML文件的最佳方法

问题描述

1 个解决方案

解决方案1
0 2016-03-22 17:45:40

蜂巢：消耗大量小型XML文件的最佳方法

问题描述

1 个解决方案

解决方案1 0 2016-03-22 17:45:40

解决方案1
0 2016-03-22 17:45:40