简体   繁体   English

在流式hadoop程序中获取输入文件名

[英]Get input file name in streaming hadoop program

I am able to find the name if the input file in a mapper class using FileSplit when writing the program in Java. 在Java中编写程序时,我能够使用FileSplit在mapper类中找到输入文件的名称。

Is there a corresponding way to do this when I write a program in Python (using streaming?) 当我用Python编写程序时(使用流式传输?),有相应的方法吗?

I found the following in the hadoop streaming document on apache: 我在apache上的hadoop流文档中找到了以下内容:

See Configured Parameters. 请参阅配置参数。 During the execution of a streaming job, the names of the "mapred" parameters are transformed. 在执行流作业期间,转换“映射”参数的名称。 The dots ( . ) become underscores ( _ ). 点(。)变为下划线(_)。 For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. 例如,mapred.job.id变为mapred_job_id,mapred.jar变为mapred_jar。 In your code, use the parameter names with the underscores. 在您的代码中,使用带下划线的参数名称。

But I still cant understand how to make use of this inside my mapper. 但我仍然无法理解如何在我的mapper中使用它。

Any help is highly appreciated. 任何帮助都非常感谢。

Thanks 谢谢

According to the "Hadoop : The Definitive Guide" 根据“Hadoop:The Definitive Guide”

Hadoop sets job configuration parameters as environment variables for Streaming programs. Hadoop将作业配置参数设置为Streaming程序的环境变量。 However, it replaces non-alphanumeric character with underscores to make sure they are valid names. 但是,它会使用下划线替换非字母数字字符,以确保它们是有效名称。 The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script: 以下Python表达式说明了如何从Python Streaming脚本中检索mapred.job.id属性的值:

os.environ["mapred_job_id"] os.environ [ “mapred_job_id”]

You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). 您还可以通过将-cmdenv选项应用于Streaming启动程序(为您要设置的每个变量一次),为MapReduce启动的Streaming进程设置环境变量。 For example, the following sets the MAGIC_PARAMETER environment variable: 例如,以下设置MAGIC_PARAMETER环境变量:

-cmdenv MAGIC_PARAMETER=abracadabra -cmdenv MAGIC_PARAMETER = abracadabra

By parsing the mapreduce_map_input_file (new) or 通过解析mapreduce_map_input_file (new)或 map_input_file (deprecated) environment variable, you will get the map input file name. (不建议使用)环境变量,您将获得地图输入文件名。

Notice: 注意:
The two environment variables are case-sensitive , all letters are lower-case . 这两个环境变量区分大小写 ,所有字母都是小写的

Hadoop 2.x的新ENV_VARIABLE是MAPREDUCE_MAP_INPUT_FILE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM