简体繁体 English

在流式hadoop程序中获取输入文件名

[英]Get input file name in streaming hadoop program

原文 2011-09-16 19:59:17 8 3 python/ input/ streaming/ hadoop/ filesplitting

I am able to find the name if the input file in a mapper class using FileSplit when writing the program in Java. 在Java中编写程序时，我能够使用FileSplit在mapper类中找到输入文件的名称。

Is there a corresponding way to do this when I write a program in Python (using streaming?) 当我用Python编写程序时（使用流式传输？），有相应的方法吗？

I found the following in the hadoop streaming document on apache: 我在apache上的hadoop流文档中找到了以下内容：

See Configured Parameters. 请参阅配置参数。 During the execution of a streaming job, the names of the "mapred" parameters are transformed. 在执行流作业期间，转换“映射”参数的名称。 The dots ( . ) become underscores ( _ ). 点（。）变为下划线（_）。 For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. 例如，mapred.job.id变为mapred_job_id，mapred.jar变为mapred_jar。 In your code, use the parameter names with the underscores. 在您的代码中，使用带下划线的参数名称。

But I still cant understand how to make use of this inside my mapper. 但我仍然无法理解如何在我的mapper中使用它。

Any help is highly appreciated. 任何帮助都非常感谢。

Thanks 谢谢

3 个解决方案

According to the "Hadoop : The Definitive Guide" 根据“Hadoop：The Definitive Guide”

Hadoop sets job configuration parameters as environment variables for Streaming programs. Hadoop将作业配置参数设置为Streaming程序的环境变量。 However, it replaces non-alphanumeric character with underscores to make sure they are valid names. 但是，它会使用下划线替换非字母数字字符，以确保它们是有效名称。 The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script: 以下Python表达式说明了如何从Python Streaming脚本中检索mapred.job.id属性的值：

os.environ["mapred_job_id"] os.environ [ “mapred_job_id”]

You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). 您还可以通过将-cmdenv选项应用于Streaming启动程序（为您要设置的每个变量一次），为MapReduce启动的Streaming进程设置环境变量。 For example, the following sets the MAGIC_PARAMETER environment variable: 例如，以下设置MAGIC_PARAMETER环境变量：

-cmdenv MAGIC_PARAMETER=abracadabra -cmdenv MAGIC_PARAMETER = abracadabra

By parsing the mapreduce_map_input_file (new) or 通过解析mapreduce_map_input_file （new）或 ~~map_input_file~~ (deprecated) environment variable, you will get the map input file name. （不建议使用）环境变量，您将获得地图输入文件名。

Notice: 注意：
The two environment variables are case-sensitive , all letters are lower-case . 这两个环境变量区分大小写 ，所有字母都是小写的 。