简体   繁体   English

hadoop作业输出文件

[英]hadoop job output files

I currently have one hadoop oozie job running. 我目前有一个hadoop oozie工作正在运行。 The output files are automatically generated. 输出文件是自动生成的。 The expected number of output files is just ONE; 预期的输出文件数仅为一个; however, there are two output files called part-r-00000 and part-r-00001. 但是,有两个输出文件称为part-r-00000和part-r-00001。 Sometimes, the first one(part-r-00000) has data, and the second one (part-r-00001) doesn't. 有时,第一个(part-r-00000)有数据,而第二个(part-r-00001)没有数据。 Sometimes, the second one has, and the first one doesn't. 有时,第二个有,第一个没有。 Can anyone tell me why? 谁能告诉我为什么? Also, How to set the output file to part-r-00000? 另外,如何将输出文件设置为part-r-00000?

The number of files output is dependent on the number of mappers and reducers. 输出的文件数取决于映射器和缩减器的数量。 In your case, the number of files and names of files indicates that your output came from 2 reducers. 就您而言,文件数和文件名指示您的输出来自2个reducer。

To limit the number of mappers or reducers is dependent on your language (Hive, Java, etc), but each has a property that you can set to limit these. 限制映射器或化简器的数量取决于您的语言(Hive,Java等),但是每个映射器或化简器都有一个属性,您可以设置该属性来限制它们。 See here for Java MapReduce jobs. 有关Java MapReduce作业,请参见此处

Files can be empty if that particular mapper or reducer task had no resulting data on the given data node. 如果特定的映射器或化简器任务在给定的数据节点上没有结果数据,则文件可以为空。

Finally, I don't think you want to limit your mappers and reducers. 最后,我认为您不想限制映射器和缩减器。 This will defeat the point of using Hadoop. 这将使使用Hadoop失去意义。 If you're aiming to read all files as one, make sure they are consolidated in a given directory and pass the directory as the file name. 如果您打算将所有文件读取为一个,请确保将它们合并到给定目录中,并将该目录作为文件名传递。 The files will be treated as one. 这些文件将被视为一个。

In Hadoop, the output files are a product of the Reducers (or Mappers if it's a map-side only job, in which case it will be a part-m-xxxxx file). 在Hadoop中,输出文件是Reducers的产物(如果是Map端唯一的工作,则为Mappers,在这种情况下,它将是part-m-xxxxx文件)。 If your job uses two reducers, that means that after each has finished with its portion, it will write to the output directory in the form of part-r-xxxxx , where the numbers denote which reducer wrote it out. 如果您的作业使用了两个化简器,则意味着每一个完成其部分后,它将以part-r-xxxxx的形式写入输出目录,其中数字表示哪个化简器将其写出。

That said, you cannot specify a single output file, but only the directory. 也就是说,您不能指定单个输出文件,而只能指定目录。 To get all of the files from the output directory into a single file, use: 要将所有文件从输出目录放入单个文件,请使用:

hdfs dfs -getmerge <src> <localdst> [addnl]

Or if you're using an older version of hadoop: 或者,如果您使用的是旧版的hadoop:

hadoop fs -getmerge <src> <localdst> [addnl]

See the shell guide for more info. 有关更多信息,请参见外壳指南

As to why one of your output files is empty, data is passed from Mappers to Reducers based on the grouping comparator . 至于为什么输出文件之一为空,则基于分组比较器将数据从Mappers传递到Reducers。 If you specify two reducers, but there is only one group (as identified by the grouping comparator), data will not be written from one reducer. 如果指定两个减速器,但只有一组(由分组比较器标识),则不会从一个减速器写入数据。 Alternatively, if some logic within the reducer prevents a writing operation, that's another reason data may not be written from one reducer. 另外,如果化简器中的某些逻辑阻止了写入操作,那是另一个原因,可能无法从一个化简器中写入数据。

The output files are by default named part-x-yyyyy where: 默认情况下,输出文件的名称为part-x-yyyyy,其中:

  • x is either 'm' or 'r', depending on whether this file was generated by a map or reduce task x是'm'或'r',取决于此文件是由地图生成还是由reduce任务生成
  • yyyyy is the mapper or reducer task number (zero based) yyyyy是映射器或化简器任务号(从零开始)

The number of tasks has nothing to do with the number of physical nodes in the cluster. 任务的数量与群集中物理节点的数量无关。 For map task output the number of tasks is given by the input splits. 对于地图任务输出,任务数由输入拆分指定。 Usually the reducer task are set with job.setNumReduceTasks() or passed as input parameter. 通常,reducer任务是通过job.setNumReduceTasks()设置的,或者作为输入参数传递的。

A job which has 100 reducers will have files named part-r-00000 to part-r-00100, one for each reducer task. 包含100个reducer的作业将具有名为part-r-00000到part-r-00100的文件,每个reducer任务一个。 A map only job with 100 input splits will have files named part-m-00000 to part-m-00100, one for each reducer task. 具有100个输入拆分的仅地图作业将具有名为part-m-00000到part-m-00100的文件,每个reducer任务一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM