简体繁体 English

使用Hadoop MapReduce处理XML

[英]Processing XML with Hadoop MapReduce

原文 2014-12-17 06:47:46 7 2 xml/ hadoop/ xml-parsing/ mapreduce

I want to load and parse some petabytes of XML data. 我想加载和解析一些PB的XML数据。 After doing lot of research on how to process XML in hadoop I have come to know that XML has to be processed as whole file in Map Reduce. 在对如何在hadoop中处理XML进行了大量研究之后，我知道必须在Map Reduce中将XML作为整个文件进行处理。

If i feed whole XML as single input split to my Map Reduce then It will not be utilizing hadoop's distributed and parallel processing feature as only one Mapper will be doing processing. 如果我将整个XML作为单个输入拆分提供给我的Map Reduce，那么它将不会利用hadoop的分布式和并行处理功能，因为只有一个Mapper会进行处理。

Is that I correctly understood? 我理解正确吗？ How to overcome this problem? 如何克服这个问题？

Please suggest 请建议

2 个解决方案

You could try and use Mahout's XMLInputFormat . 您可以尝试使用Mahout的XMLInputFormat 。 XMLInputFormat takes care of figuring out the record boundaries with in your XML input files using the specified start and end tags. XMLInputFormat使用指定的开始和结束标记来确定XML输入文件中的记录边界。

You could use this link as reference on how to use XMLInputFormat to parse your XML files. 您可以使用此链接作为如何使用XMLInputFormat解析XML文件的参考。

If you have a single block of XML data that is a petabyte in size, you have a problem. 如果您有一个XML数据块，大小为PB，则存在问题。 More likely you have millions or billions of individual XML records. 您更有可能拥有数百万或数十亿的单个XML记录。 If that is the case, you have a rather straightforward approach: create millions of XML files that have a size that is roughly the same (a little smaller) than the block size of your HDFS system. 如果是这样，您可以采用一种相当简单的方法：创建数百万个XML文件，这些文件的大小与HDFS系统的块大小大致相同（略小）。 Then write a set of MapReduce jobs where the first mapper extracts the XML data and outputs whatever (name,value) pairs are useful, and the reducer collects all of the different (name) pairs from the various XML files that require correlation. 然后编写一组MapReduce作业，其中第一个映射器提取XML数据并输出有用的任何（名称，值）对，而reducer从需要关联的各种XML文件中收集所有不同的（名称）对。

If the XML dataset is changing over time you may wish to look at support for streaming datasets. 如果XML数据集随时间变化，则您可能希望查看对流数据集的支持。