Hadoop MapReduce占用小文件的内存不足

Question

I'm running a MapReduce job against about 3 million small files on Hadoop (I know, I know, but there's nothing we can do about it - it's the nature of our source system). 我正在针对Hadoop上的大约300万个小文件运行MapReduce作业（我知道，我知道，但是我们对此无能为力-这是源系统的本质）。

Our code is nothing special - it uses CombineFileInputFormat to wrap a bunch of these files together, then parses the file name to add it into the contents of the file, and spits out some results. 我们的代码没什么特别的-它使用CombineFileInputFormat将这些文件包装在一起，然后解析文件名以将其添加到文件内容中，并吐出一些结果。 Easy peasy. 十分简单。

So, we have about 3 million ~7kb files in HDFS . 因此， HDFS大约有300万个〜7kb文件。 If we run our task against a small subset of these files (one folder, maybe 10,000 files), we get no trouble. 如果我们对这些文件的一小部分（一个文件夹，可能是10,000个文件）运行任务，则不会遇到麻烦。 If we run it against the full list of files, we get an out of memory error. 如果对文件的完整列表运行它，则会遇到内存不足错误。

The error comes out on STDOUT : 错误出现在STDOUT ：

#
# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 15690"...

I'm assuming what's happening is this - whatever JVM is running the process that defines the input splits is getting totally overwhelmed trying to handle 3 million files, it's using too much memory, and YARN is killing it. 我以为这是发生了什么事-JVM在运行定义输入拆分的进程时，试图处理300万个文件变得完全不知所措，它使用了过多的内存，而YARN正在杀死它。 I'm willing to be corrected on this theory. 我愿意对此理论进行纠正。

So, what I need to know how to do is to increase the memory limit for YARN for the container that's calculating the input splits, not for the mappers or reducers. 因此，我需要知道的是为计算输入拆分的容器（而不是映射器或精简器）增加YARN的内存限制。 Then, I need to know how to make this take effect. 然后，我需要知道如何使它生效。 (I've Googled pretty extensively on this, but with all the iterations of Hadoop over the years, it's hard to find a solution that works with the most recent versions...) （我已经对此进行了广泛的Google搜索，但是随着Hadoop多年来的所有迭代，很难找到适用于最新版本的解决方案...）

This is Hadoop 2.6.0, using the MapReduce API, YARN framework, on AWS Elastic MapReduce 4.2.0. 这是Hadoop 2.6.0，在AWS Elastic MapReduce 4.2.0上使用MapReduce API和YARN框架。

Answer 1

I would spin up a new EMR cluster and throw a larger master instance at it to see if that is the issue. 我将启动一个新的EMR群集，并向其扔一个更大的主实例，以查看是否是问题所在。

--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.4xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge

If the master is running out of memory when configuring the input splits you can modify the configuration EMR Configuration 如果在配置输入拆分时主服务器内存不足，则可以修改配置EMR配置

Answer 2

Instead of running the MapReduce on 3 million individual files, you can merge them into manageable bigger files using any of the following approaches. 您可以使用以下任何一种方法将它们合并为可管理的更大文件，而不用对300万个单独的文件运行MapReduce。 1. Create Hadoop Archive ( HAR) files from the small files. 1.从小文件创建Hadoop存档（HAR）文件。 2. Create sequence file for every 10K-20K files using MapReduce program. 2.使用MapReduce程序为每10K-20K文件创建一个序列文件。 3. Create a sequence file from your individual small files using forqlift tool. 3.使用forqlift工具从您的单个小文件创建一个序列文件。 4. Merge your small files into bigger files using Hadoop-Crush. 4.使用Hadoop-Crush将小文件合并为更大的文件。

Once you have the bigger files ready, you can run the MapReduce on your whole data set. 准备好较大的文件后，就可以在整个数据集上运行MapReduce。

Hadoop MapReduce占用小文件的内存不足

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-11-20 21:04:00

解决方案2
0 2015-11-23 06:16:07

Hadoop MapReduce占用小文件的内存不足

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-11-20 21:04:00

解决方案2 0 2015-11-23 06:16:07

解决方案1
1 已采纳 2015-11-20 21:04:00

解决方案2
0 2015-11-23 06:16:07