使用Python Mapper进行Hadoop流式传输的多个输出文件

Question

I am looking for a little clarification on the the answers to this question here: 我想在这里找一点澄清这个问题的答案：

Generating Separate Output files in Hadoop Streaming 在Hadoop Streaming中生成单独的输出文件

My use case is as follows: 我的用例如下：

I have a map-only mapreduce job that takes an input file, does a lot of parsing and munging, and then writes back out. 我有一个map-only mapreduce作业，它接受一个输入文件，进行大量的解析和修改，然后写回来。 However, certain lines may or may not be in an incorrect format, and if that is the case, I would like to write the original line to a separate file. 但是，某些行可能或可能不是格式不正确，如果是这种情况，我想将原始行写入单独的文件。

It seems that one way to do this would be to prepend the name of the file to the line I am printing and use the multipleOutputFormat parameter. 似乎这样做的一种方法是将文件名添加到我正在打印的行中并使用multipleOutputFormat参数。 For example, if I originally had: 例如，如果我原来有：

if line_is_valid(line):
    print name + '\t' + comments

I could instead do: 我可以这样做：

if line_is_valid(line):
    print valid_file_name + '\t' + name + '\t' + comments
else:
    print err_file_name + '\t' + line

The only problem I have with this solution is that I don't want the file_name to appear as the first column in the textfiles. 我对此解决方案的唯一问题是我不希望file_name显示为文本文件中的第一列。 I suppose I could then run another job to strip out the first column of each file, but that seems kind of silly. 我想我可以再运行另一个工作去除每个文件的第一列，但这看起来有点傻。 So: 所以：

1) Is this the correct way to manage multiple output files with a python mapreduce job? 1）这是使用python mapreduce作业管理多个输出文件的正确方法吗？

2) What is the best way to get rid of that initial column? 2）摆脱初始列的最佳方法是什么？

Answer 1

You can do something like the following, but it involves a little Java compiling, which I think shouldn't be a problem, if you want your use case done anyway with Python- From Python, as far as I know it's not directly possible to skip the filename from the final output as your use case demands in a single job. 您可以执行以下操作，但它涉及一些Java编译，我认为这应该不是问题，如果您希望使用Python完成用例 - 从Python开始，据我所知，它不是直接可能的根据您的用例在单个作业中的要求，跳过最终输出中的文件名。 But what's shown below can make it possible with ease! 但是下面显示的内容可以轻松实现！

Here is the Java class that's need to compiled - 这是需要编译的Java类 -

package com.custom;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;

 public class CustomMultiOutputFormat extends MultipleTextOutputFormat<Text, Text> {
        /**
        * Use they key as part of the path for the final output file.
        */
       @Override
       protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
             return new Path(key.toString(), leaf).toString();
       }

       /**
        * We discard the key as per your requirement
        */
       @Override
       protected Text generateActualKey(Text key, Text value) {
             return null;
       }
 }

Steps to compile: 编译步骤：

Save the text to a file exactly (no different name) CustomMultiOutputFormat.java 准确地将文本保存到文件（没有不同的名称） CustomMultiOutputFormat.java
While you are in the directory where the above saved file is, type - 当您在上面保存的文件所在的目录中时，键入 -
$JAVA_HOME/bin/javac -cp $(hadoop classpath) -d . CustomMultiOutputFormat.java
Make sure JAVA_HOME is set to /path/to/your/SUNJDK before attempting the above command. 在尝试上述命令之前，请确保将JAVA_HOME设置为/ path / to / your / SUNJDK。
Make your custom.jar file using (type exactly) - 使用（完全键入）使用custom.jar文件 -
$JAVA_HOME/bin/jar cvf custom.jar com/custom/CustomMultiOutputFormat.class
Finally, run your job like - 最后，运行你的工作 -
hadoop jar /path/to/your/hadoop-streaming-*.jar -libjars custom.jar -outputformat com.custom.CustomMultiOutputFormat -file your_script.py -input inputpath --numReduceTasks 0 -output outputpath -mapper your_script.py

After doing these you should see two directories inside your outputpath one with valid_file_name and other with err_file_name . 执行这些操作后，您应该在输出路径中看到两个目录，一个带有valid_file_name ，另一个带有err_file_name 。 All records having valid_file_name as a tag will go to valid_file_name directory and all records having err_file_name would go to err_file_name directory. 所有将valid_file_name作为标记的记录将转到valid_file_name目录，所有具有err_file_name的记录将转到err_file_name目录。

I hope all these makes sense. 我希望所有这些都有道理。

使用Python Mapper进行Hadoop流式传输的多个输出文件

问题描述

1 个解决方案

解决方案1
18 已采纳 2013-09-01 19:58:39

使用Python Mapper进行Hadoop流式传输的多个输出文件

问题描述

1 个解决方案

解决方案1 18 已采纳 2013-09-01 19:58:39

解决方案1
18 已采纳 2013-09-01 19:58:39