简体   繁体   English

控制hadoop映射器输出文件的数量

[英]Control number of hadoop mapper output files

I have a job for hadoop. 我有一个Hadoop的工作。 When the job is stated, i have some number of mappers started. 陈述工作时,我已经启动了一些映射器。 And each mapper write some file to disk, like part-m-00000, part-m-00001. 每个映射器都会将一些文件写入磁盘,例如part-m-00000,part-m-00001。 As I understand, each mapper create one part file. 据我了解,每个映射器都会创建一个零件文件。 I have big amount of data, so there must be more than one mapper, but can I somehow control number of this output files? 我有大量数据,因此必须有一个以上的映射器,但是我可以以某种方式控制此输出文件的数量吗? I mean, hadoop will start, for example 10 mappers, but there will be only three part files? 我的意思是,hadoop将启动,例如10个映射器,但是只有3个零件文件?

I found this post How do multiple reducers output only one part-file in Hadoop? 我发现了这篇文章多个reducer如何在Hadoop中仅输出一个零件文件? But there is using old version of hadoop library. 但是有使用旧版本的hadoop库。 I'm using classes from org.apache.hadoop.mapreduce.* and not from org.apache.hadoop.mapred.* 我使用的是org.apache.hadoop.mapreduce。*中的类,而不是org.apache.hadoop.mapred。*中的类

I'm using hadoop version 0.20, and hadoop-core:1.2.0.jar 我正在使用hadoop版本0.20和hadoop-core:1.2.0.jar

Is there any possibility to do this, using new hadoop API? 使用新的hadoop API是否有可能做到这一点?

The number of output files equals to the number of reducers or the number of the mappers if there aren't any reducers. 如果没有任何化简器,则输出文件的数量等于化简器的数量或映射器的数量。

You can add a single reducer to your job so that the output from all the mappers will be directed to it and your get a single output file. 您可以在作业中添加一个化简器,以便所有映射器的输出都将定向到该化简器,并且您将获得一个输出文件。 Note that will be less efficient as all the data (output of mappers) will be sent over the wire (network IO) to the node where the reducer will run. 请注意,这样做会降低效率,因为所有数据(映射器的输出)都将通过电线(网络IO)发送到运行reducer的节点。 Also since a single process will (eventually) get all the data it would probably run slower. 同样,由于单个进程将(最终)获取所有数据,因此它可能运行得较慢。

By the wat,the fact that there are multiple parts shouldn't be very significant as you can pass the directory containing them to subsequent jobs 顺便一提,包含多个部分的事实并不重要,因为您可以将包含它们的目录传递给后续作业

Im not sure you can do it (your link is about multiple outputs not converging to only one), and why use only one output ? 我不确定您能做到这一点(您的链接是关于多个输出而不是仅收敛到一个),为什么只使用一个输出? you will lose all parallelism on sort ? 您将失去所有并行性吗?

Im also working on big files (~10GB each) and my MR process almost 100GB each. 我还处理大文件(每个文件约10GB),而我的MR处理每个文件将近100GB。 So to lower Map numbers, I set a higher value of block size in hdfs (applies only to newer files) and a higher value of mapred.min.split.size in mapred-site.xml 因此,为降低Map数量,我在hdfs中设置了较高的块大小值(仅适用于较新的文件),并在mapred-site.xml了较高的mapred.min.split.size值。

You might want to look at MultipleOutputFormat 您可能想看看MultipleOutputFormat

Part of what Javadoc says: Javadoc所说的部分内容:

This abstract class extends the FileOutputFormat, allowing to write the output data to different output files. 这个抽象类扩展了FileOutputFormat,允许将输出数据写入不同的输出文件。

Both Mapper and Reducer can use this. Mapper和Reducer都可以使用此功能。

Check this link for how you can specify a output file name or more from different mappers to output to HDFS. 检查此链接 ,了解如何从不同的映射器指定一个或多个输出文件名以输出到HDFS。

NOTE: And, moreover, make sure you don't use context.write() so that, 10 files from 10 mapper don't get created. 注意:此外,请确保不要使用context.write(),这样就不会创建10个映射器中的10个文件。 Use only MultipleOutputFormat to output. 仅使用MultipleOutputFormat进行输出。

If the job has no reducers, partitioners and combiners, each mapper outputs one output file. 如果作业没有减速器,分区器和合并器,则每个映射器将输出一个输出文件。 At some point, you should run some post processing to collect the outputs into large file. 在某个时候,您应该运行一些后期处理以将输出收集到大文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM