HDFS - 加载大量文件

Question

For testing purposes I'm trying to load a massive amount of small files into HDFS.出于测试目的，我试图将大量小文件加载到 HDFS 中。 Actually we talk about 1 Million (1'000'000) files with a size from 1KB to 100KB.实际上，我们谈论的是 100 万 (1'000'000) 个大小从 1KB 到 100KB 的文件。 I generated those files with an R-Script on a Linux-System in one folder.我在一个文件夹中的 Linux 系统上使用 R 脚本生成了这些文件。 Every file has a information structure that contains a header with product information and a different number of columns with numeric information.每个文件都有一个信息结构，其中包含一个包含产品信息的标题和不同数量的包含数字信息的列。

The problem is when I try to upload those local files into HDFS with the command:问题是当我尝试使用以下命令将这些本地文件上传到 HDFS 时：

hdfs dfs -copyFromLocal /home/user/Documents/smallData /

Then i get one of the following Java-Heap-Size errors:然后我得到以下 Java-Heap-Size 错误之一：

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space线程“main”中的异常 java.lang.OutOfMemoryError: Java heap space

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded线程“main”中的异常 java.lang.OutOfMemoryError：超出 GC 开销限制

I use the Cloudera CDH5 distribution with a Java-Heap-Size about 5 GB.我使用 Cloudera CDH5 发行版，Java 堆大小约为 5 GB。 Is there another way than increasing this Java-Heap-Size even more?除了更多地增加这个 Java-Heap-Size 之外，还有其他方法吗？ Maybe a better way to load this mass amount of data into HDFS?也许是将大量数据加载到 HDFS 的更好方法？

I'm very thankfully for every helpful comment!我非常感谢每一个有用的评论！

Answer 1

If you will increase the memory and store the files in HDFS.如果您将增加内存并将文件存储在 HDFS 中。 After this you will get many problems at the time of processing.在此之后，您将在处理时遇到许多问题。

Problems with small files and HDFS小文件和 HDFS 的问题

A small file is one which is significantly smaller than the HDFS block size (default 64MB).小文件是明显小于 HDFS 块大小（默认 64MB）的文件。 If you're storing small files, then you probably have lots of them (otherwise you wouldn't turn to Hadoop), and the problem is that HDFS can't handle lots of files.如果您正在存储小文件，那么您可能有很多文件（否则您不会转向 Hadoop），问题是 HDFS 无法处理大量文件。

Every file, directory and block in HDFS is represented as an object in the namenode's memory, each of which occupies 150 bytes, as a rule of thumb.根据经验，HDFS 中的每个文件、目录和块都表示为 namenode 内存中的一个对象，每个对象占用 150 个字节。 So 10 million files, each using a block, would use about 3 gigabytes of memory.因此，1000 万个文件，每个文件使用一个块，将使用大约 3 GB 的内存。 Scaling up much beyond this level is a problem with current hardware.远远超出这个水平是当前硬件的一个问题。 Certainly a billion files is not feasible.当然十亿个文件是不可行的。

Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files.此外，HDFS 并不适合高效访问小文件：它主要是为大文件的流访问而设计的。 Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.读取小文件通常会导致大量的查找和大量从数据节点到数据节点的跳转以检索每个小文件，所有这些都是低效的数据访问模式。

Problems with small files and MapReduce小文件和 MapReduce 的问题

Map tasks usually process a block of input at a time (using the default FileInputFormat). Map 任务通常一次处理一个输入块（使用默认的 FileInputFormat）。 If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead.如果文件很小并且有很多，那么每个映射任务处理的输入很少，并且映射任务更多，每个任务都会带来额外的簿记开销。 Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files.比较分成 16 个 64MB 块的 1GB 文件和 10,000 个左右的 100KB 文件。 The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file. 10,000 个文件每个使用一张地图，作业时间可能比使用单个输入文件的等效文件慢数十或数百倍。

There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see the mapred.job.reuse.jvm.num.tasks property), and MultiFileInputSplit which can run more than one split per map.有几个特性可以帮助减轻簿记开销：任务 JVM 重用，用于在一个 JVM 中运行多个映射任务，从而避免一些 JVM 启动开销（请参阅 mapred.job.reuse.jvm.num.tasks 属性）和 MultiFileInputSplit每个地图可以运行多个拆分。

Solution解决方案

Hadoop Archives (HAR files) Hadoop 档案（HAR 文件）

Create .HAR File Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode's memory.创建.HAR文件 Hadoop 档案（HAR 文件）在 0.18.0 中被引入 HDFS，以缓解大量文件对 namenode 内存施加压力的问题。 HAR files work by building a layered filesystem on top of HDFS. HAR 文件通过在 HDFS 之上构建分层文件系统来工作。 A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files使用 hadoop archive 命令创建 HAR 文件，该命令运行 MapReduce 作业将正在归档的文件打包成少量 HDFS 文件

hadoop archive -archiveName name -p <parent> <src>* <dest> 
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

Sequence Files序列文件

The usual response to questions about “the small files problem” is: use a SequenceFile.对有关“小文件问题”的问题的通常回答是：使用 SequenceFile。 The idea here is that you use the filename as the key and the file contents as the value.这里的想法是使用文件名作为键，使用文件内容作为值。 This works very well in practice.这在实践中非常有效。 Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile.回到 10,000 个 100KB 文件，您可以编写一个程序将它们放入单个 SequenceFile，然后您可以在 SequenceFile 上以流式方式（直接或使用 MapReduce）对它们进行处理。 There are a couple of bonuses too.还有一些奖金。 SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. SequenceFiles 是可拆分的，因此 MapReduce 可以将它们分成块并独立操作每个块。 They support compression as well, unlike HARs.与 HAR 不同，它们也支持压缩。 Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record)在大多数情况下，块压缩是最佳选择，因为它压缩多个记录的块（而不是每个记录）

HBase HBase

If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate.如果您要生成大量小文件，则根据访问模式，不同类型的存储可能更合适。 HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. HBase 将数据存储在 MapFiles（索引的 SequenceFiles）中，如果您需要偶尔随机查找进行 MapReduce 风格的流分析，HBase 是一个不错的选择。 If latency is an issue, then there are lots of other choices如果延迟是一个问题，那么还有很多其他选择

Answer 2

Try to increase HEAPSIZE尝试增加 HEAPSIZE

HADOOP_HEAPSIZE=2048 hdfs dfs -copyFromLocal /home/user/Documents/smallData

look here看这里

Answer 3

First of all: If this isn't a stress test on your namenode it's ill advised to do this.首先：如果这不是对您的 namenode 的压力测试，则不建议这样做。 But I assume you know what you are doing.但我假设你知道你在做什么。 (expect slow progress on this) （预计这方面进展缓慢）

If the objective is to just get the files on HDFS, try doing this in smaller batches or set a higher heap size on your hadoop client .如果目标只是在 HDFS 上获取文件，请尝试以较小的批次执行此操作或在您的 hadoop客户端上设置更高的堆大小。

You do this like rpc1 mentioned in his answer by prefixing HADOOP_HEAPSIZE=<mem in Mb here> to your hadoop -put command.您可以像他的回答中提到的rpc1一样这样做，方法是在您的hadoop -put命令前添加HADOOP_HEAPSIZE=<mem in Mb here>前缀。

Answer 4

Hadoop Distributed File system is not good with many small files but with many big files. Hadoop 分布式文件系统不好处理很多小文件，但是可以处理很多大文件。 HDFS keep a record in a look up table that points to every file/block in HDFS and this Look up table usually is loaded in memory. HDFS 在查找表中保存一条记录，该表指向 HDFS 中的每个文件/块，并且该查找表通常加载到内存中。 So you should not just increase java heap size but also increase the heap size of the name node inside hadoop-env.sh, this is the default:所以你不应该只增加java堆大小，还要增加hadoop-env.sh中名称节点的堆大小，这是默认值：

export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_INIT_HEAPSIZE="1000"

If you are going to do processing on those files, you should expect low performance on the first MapReduce job you run on them (Hadoop creates number of map tasks as the number of files/blocks and this will overload your system except when you use combineinputformat).如果您打算对这些文件进行处理，您应该期望您在它们上运行的第一个 MapReduce 作业的性能较低（Hadoop 创建的映射任务数量作为文件/块的数量，这将使您的系统过载，除非您使用 combineinputformat ）。 advice you to either merge the files into big files (64MB/ 128MB) or use another data source (not HDFS).建议您将文件合并为大文件（64MB/128MB）或使用其他数据源（非 HDFS）。

Answer 5

For solve this problem, I build a single file with some format.为了解决这个问题，我构建了一个具有某种格式的单个文件。 The content of file are all the small files.文件的内容都是小文件。 The format will be like that:格式将是这样的：

<DOC>
  <DOCID>1</DOCID>
  <DOCNAME>Filename</DOCNAME>
  <DOCCONTENT>
    Content of file 1
  </DOCCONTENT>
</DOC>

This structure could be more or less field, but the idea is the same.这种结构可能或多或少是场，但思路是一样的。 For example, I have use this stucture:例如，我使用了这种结构：

<DOC>
  <DOCID>1</DOCID>
  Content of file 1
</DOC>

And handle more of six million files.并处理超过 600 万个文件。

If you desire process each file for a one map task, you could be delete \\n char between and tags.如果您希望为一个地图任务处理每个文件，您可以删除和标签之间的 \\n 字符。 After this, you only parse the structure and have the doc identifier and Content.在此之后，您只需解析结构并拥有文档标识符和内容。

HDFS - 加载大量文件

问题描述

5 个解决方案

解决方案1
1 2015-08-14 05:49:54

Problems with small files and HDFS小文件和 HDFS 的问题

Problems with small files and MapReduce小文件和 MapReduce 的问题

Solution解决方案

Hadoop Archives (HAR files) Hadoop 档案（HAR 文件）

Sequence Files序列文件

HBase HBase

解决方案2
0 2015-08-13 07:45:23

解决方案3
0 2015-08-13 07:53:17

解决方案4
0 2015-08-13 23:11:36

解决方案5
0 2015-08-15 02:53:21

HDFS - 加载大量文件

问题描述

5 个解决方案

解决方案1 1 2015-08-14 05:49:54

Problems with small files and HDFS小文件和 HDFS 的问题

Problems with small files and MapReduce小文件和 MapReduce 的问题

Solution解决方案

Hadoop Archives (HAR files) Hadoop 档案（HAR 文件）

Sequence Files序列文件

HBase HBase

解决方案2 0 2015-08-13 07:45:23

解决方案3 0 2015-08-13 07:53:17

解决方案4 0 2015-08-13 23:11:36

解决方案5 0 2015-08-15 02:53:21

解决方案1
1 2015-08-14 05:49:54

解决方案2
0 2015-08-13 07:45:23

解决方案3
0 2015-08-13 07:53:17

解决方案4
0 2015-08-13 23:11:36

解决方案5
0 2015-08-15 02:53:21