简体繁体 English

Hadoop：每个tar / zip文件一个映射器

[英]Hadoop: One mapper for each tar/zip file

原文 2015-05-22 18:47:56 4 1 hadoop/ hadoop-streaming

I have several directories on which I want to compute statistics. 我有几个要在其上计算统计信息的目录。 ie my mapper function takes as input one folder tree and spits out some statics based on the contents of the directory and all its sub-directories. 即我的映射器函数将一棵文件夹树作为输入，并根据目录及其所有子目录的内容吐出一些静态信息。 The computation takes a long time on each directory. 每个目录上的计算需要很长时间。 There is no reducer. 没有减速器。

I can create one tar/zip file for each directory I want to process and copy it into HDFS. 我可以为要处理的每个目录创建一个tar / zip文件，并将其复制到HDFS中。 But how do I ensure that a mapper will be created for each tar file and the entire contents of the tar file be sent to that mapper (so that I can traverse the contents of the tar file and generate statistics for that file)? 但是，如何确保为每个tar文件创建一个映射器，并将tar文件的全部内容发送到该映射器（以便我可以遍历tar文件的内容并生成该文件的统计信息）？

I would prefer to do this in Hadoop Streaming if possible. 如果可能，我宁愿在Hadoop流中执行此操作。 Is it possible to do this? 是否有可能做到这一点？

1 个解决方案

I take it you have a number of tar/zip files in HDFS as the input to your map/reduce job? 我认为您在HDFS中有许多tar / zip文件作为您的地图/归约工作的输入？

In that case you'll have to implement your own InputFormat to handle these. 在这种情况下，您将必须实现自己的InputFormat来处理这些。 The input format implementation ( getSplits() ) determines the number of splits , each of those gets an individual mapper. 输入格式的实现（ getSplits() ）确定拆分的数量，每个拆分都获得一个单独的映射器。 So if you just return a single split for each input file, you'll be all set. 因此，如果您只为每个输入文件返回一个拆分，则一切就绪。

As far as I can see in the documentation, nothing in Hadoop Streaming prevents you from specifying your own InputFormat; 据我在文档中看到的那样，Hadoop Streaming中的任何内容都不能阻止您指定自己的InputFormat。 this requires you to write a Java class though. 但这需要您编写一个Java类。 (The question is how the inputformat and the script-based mapper should interact: as far as I understand, Hadoop streaming requires the mapper to receive its input via stdin, ie you can't easily pass the tar file itself for the script to operate on.) （问题是inputformat和基于脚本的映射器应该如何交互：据我所知，Hadoop流式传输要求映射器通过stdin接收其输入，即，您无法轻易地将tar文件本身传递给脚本来进行操作上。）